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ABSTRACT 


Statistical  databases  provide  statistical  information 
to  user  queries.  The  security  problem  for  a  statistical 
database  is  to  limit  the  use  of  the  database  so  that  no 
sequence  of  queries  is  sufficient  to  deduce  confidential 
information  about  any  individual .  In  this  thesis,  the 
security  problem  of  statistical  databases  is  investigated  in 
the  context  of  a  statistical  database  (SDB)  design.  New 
results  involving  a  comprehensive  secure  SDB  design  are 
de  scribed . 

A  partitioning  model  of  the  SDB  is  discussed  in  order 
to  be  used  as  a  tool  in  the  SDB  design.  Primitive  change 
operations  are  allowed  in  the  model,  and  the  conditions  are 
derived  to  prevent  compromise.  Variations  of  the 
partitioning  model  which  use  either  rounding,  or  data 
perturbation  or  both  are  introduced  to  remove  some  of  the 
assumptions  made  in  the  partitioning  model . 

The  importance  of  semantic  meaningful  ness  of  users' 
queries  is  stressed.  It  is  argued  that  it  will  enhance  the 
security  by  not  permitting  malicious  users  to  form 
meaningless  queries  in  order  to  use  their  responses  in 
combinatorial  formulas  for  compromise.  Within  the  context  of 
a  formal  framework,  an  SDB  design  using  security  constraints 
at  the  conceptual  data  model  level  is  proposed.  Three 
different  structural,  semantic  and  redundant  data  models  are 
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investigated  and  the  D-A  model  [Smith  and  Smith,  1977]  is 
chosen  as  a  conceptual  data  model  of  the  SDB .  The  population 
concept  is  utilized  to  identify  semantically  well-defined 
objects  about  which  statistical  information  is  revealed  to 
users.  For  this  purpose,  the  Population  Definition  Construct 
(PDC)  is  introduced  for  each  population  in  the  conceptual 
model . 

It  is  argued  that,  for  complete  protection,  users' 
additional  knowledge  should  be  maintained  and  kept 
up-to-date.  Users'  additional  knowledge  may  take  the  form  of 
general  rules  and  explicit  facts.  The  SDB  design  proposed 
herein  maintains  only  the  users'  knowledge  of  protected 
property  values  of  individuals  in  the  SDB  using  the  User 
Knowledge  Construct  (UKC). 

In  order  to  keep  the  PDCs  and  UKCs  up-to-date,  to 
enforce  the  security  constraints  and  to  help  the  DBA  in 
security-related  decision  problems,  the  constraint  enforcer 
and  checker  (CEC)  is  proposed.  The  CEC,  UKCs  and  PDCs 
comprise  the  Statistical  Security  Management  Facility 
(SSMF).  Implementation  issues  of  the  SSMF  are  briefly 
discussed . 

Different  types  of  inferences  by  users  are  identified, 
and  possible  security  constraints  for  different  types  of 
statistical  queries  are  investigated.  It  is  demonstrated 
that,  usually,  simple  security  constraints  can  be  defined  to 
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protect  the  SDB  from  compromise 


Extensions  to  the  SDB  design  are  described  which 
includes  a  Question-Answering  System,  a  security  kernel  and 
a  set  of  security-related  high  level  commands  for  handling 
the  changes  in  the  SDB. 
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CHAPTER  1 


INTRODUCTION 


1.1  Computer  Security- 

Computer  security  has  always  been  an  important  issue. 
However,  due  to  the  wide-spread  usage  of  computers  and  large 
quantities  of  shared  data,  the  issue  has  gained  far  greater 
importance . 

Parker  [Parker,  1976]  reports  that  the  median  and  the 
total  known  loss  in  the  cases  of  computer  abuse  were  around 
$500,000  and  $100  million  annually.  These  figures  are 
expected  to  rise  [Denning  and  Denning,  1979]  unless 
countermeasures  are  taken.  Goal  of  the  computer  security 
research  is  to  devise  safeguards  to  prevent  possible 
computer  abuse. 

Definitions  of  Security,  Privacy  and  Confidentiality 
are  presented  below  [ACM,  1974] : 

Security.  Data  security  is  the  protection  of  data  against 
accidential  or  intentional  destruction,  disclosure,  or 
modification.  Computer  security  refers  to  the  technological 

safeguards  and  managerial  procedures  which  can  be  applied  to 
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computer  hardware,  programs,  and  data  to  assure  that 
organizational  assets  and  individual  privacy  are  protected. 

Privacy.  Privacy  is  a  concept  which  applies  to  an 
individual.  It  is  the  right  of  an  individual  to  decide  what 
information  (s)he  wishes  to  share  with  others  and  also  what 
information  (s)he  is  willing  to  accept  from  others. 

Confidentiality.  Confidentiality  is  a  concept  which  applies 
to  data.  It  is  the  status  accorded  to  data  which  has  been 
agreed  upon  between  the  person  or  organization  furnishing 
the  data  and  the  organization  receiving  it  and  which 
describes  the  degree  of  protection  to  be  provided. 

Security  mechanisms  may  be  classified  into  two  groups 
[Denning  and  Denning,  1979].  Internal  security  mechanisms 
control  the  operation  of  the  computer  system  in  four  areas: 
access  control  to  stored  objects,  information  flow  control 
between  stored  objects,  encryption  of  confidential  data 
transmitted  on  communications  lines,  and  inference  control 
of  confidential  data  stored  in  statistical  databases 
[Denning  and  Denning,  1979;  Hsiao  et  al . ,  1978;  Madnick, 
1979].  External  security  mechanisms  control  operations 
outside  the  main  computing  system,  examples  are  fire 
protection,  personnel  screening,  etc.  [Madnick,  1979; 
Shankar,  1977;  Nielsen  et  al . ,  1976]. 

This  thesis  is  concerned  with  the  inference  control  of 
confidential  data  stored  in  statistical  databases. 
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1.2  Statistical  Database  Security 

A  statistical  database  (SDB)  has  been  defined  as  one 
which  returns  statistical  information,  such  as  frequency 
counts  of  records  satisfying  some  given  criteria,  as  opposed 
to  a  database  which  returns  complete  details  of  a  record, 
for  example  name  and  address  of  an  employee.  Such 
statistical  databases  have  wide  applicability  in  medical 
research,  health  planning  and  political  planning,  to  name 
just  a  few. 

The  security  problem  for  a  statistical  database  is  to 
limit  its  use  so  that  only  statistical  information  is 
available  and  no  sequence  of  queries  is  sufficient  to  derive 
confidential  information  about  any  individual.  When  such 
information  is  obtained  the  database"1"  is  said  to  be 
compromised  (or  disclosure  has  occurred).  Notice  that  the 
protected  information  does  not  necessarily  reside  in  the 
database . 

In  order  to  clarify  the  nature  of  the  security  problem, 
three  examples  are  now  presented.  First  consider  an  on-line 
system  that  gives  information  about  the  number  of 
individuals  having  certain  properties  (i.e.,  COUNT 
information  is  revealed).  The  system  tells  the  user  that  a 
total  of  three  people  have  the  following  properties;  age  39, 

+In  this  thesis,  the  terms  database  and  statistical  database 
will  be  used  interchangeably. 
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male,  married,  live  in  Edmonton  and  are  lawyers.  Suppose 
further  that  the  user  knows  a  particular  lawyer  having  all 
of  above  characteristics  and  his  data  is  included  in  the 
database.  Now  if  the  user  enquires  the  number  of  people 
having  all  of  the  above  properties  and  in  addition  with 
earnings  over  $50,000  a  year,  and  gets  an  answer  of  3,  then 
the  user  u  knows  immediately  that  this  particular  lawyer 
earns  over  $50,000  a  year. 


As  another  example,  consider  an  off-line  system,  e.g., 
a  census  publication  office,  that  publishes  tables  of 
statistical  information.  Suppose  a  small  county  has  six 
hardware  stores  and  a  city  within  the  county  has  four  of 
them.  If  retail  sales  are  published  for  the  county  and  for 
the  city  then  each  of  the  two  out-of-town  stores  can 
determine  the  other's  sales  simply  by  taking  differences 
between  the  published  county  and  city  figures. 


For  the  third  example  consider  an  on-line  database  of 
employees  of  a  company,  in  which  salary  ranges  for  employees 
are  not  protected.  It  is  also  known  that  every  electrical 
engineer  with  salary  <  $20,000,  at  least  5  years  working 
experience  and  a  B.S.  degree  has  had  at  least  one  "bad" 
rating  from  his  manager.  Suppose  the  user  knows  an 
electrical  engineer  with  a  B.S.  degree  and  experience  for 
more  than  5  years  in  the  company.  If  both  queries,  "number 
of  electrical  engineers  with  B.S.  degree  and  at  least  5 
years  experience"  and  "number  of  electrical  engineers  with 
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B.S.  degree,  at  least  5  years  experience  and  salary  < 
$20,000"  have  the  same  answer  then  that  particular 
electrical  engineer  has  had  at  least  one  "bad"  rating  from 
his  manager. 

The  first  two  examples  illustrate  two  different  kinds 
of  appl ications , an  on-line  and  off-line  application  while 
the  third  example  illustrates  the  fact  that  the  information 
deduced  by  the  user  may  not  be  stored  in  the  database.  For 
the  off-line  application,  statistical  offices  traditionally 
examine  their  publications  carefully  to  ensure  that  there  is 
no  disclosure.  However  increasing  demand  for  detailed 
information  and  possible  use  of  computers  for  correlating 
several  publications  to  disclose  further  information  have 
prompted  researchers  to  consider  more  strict  security 
measures,  such  as  data  perturbation  techniques  [Hansen, 

1971;  Nargundkar  and  Saveland,  1972;  Fellegi  and  Phillips, 
1972],  In  the  case  of  on-line  databases,  instead  of  storing 
aggregate  information,  the  database  contains  anonymous  but 
individual  records,  and  returns  statistical  summaries  of 
those  records  which  satisfy  the  specific  characteristics 
given  in  the  query.  Changes  to  the  database,  such  as 
insertions,  deletions  and  updates,  are  allowed  and  responses 
to  queries  are  expected  to  reflect  the  current  status  of  the 
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1.3  Overview  and  Outline  of  the  Thesis 

In  Chapter  2,  an  overview  of  the  previous  studies  on 
SDB  security  is  provided.  Using  a  set  of  "goodness" 
criteria,  proposed  protection  policies  are  discussed  and 
evaluated . 

A  partitioning  model  for  dynamic  SDBs  is  investigated 
in  Chapter  3.  The  information  revealed  to  the  users  during 
the  insertions,  deletions  and  updates  is  characterized  and 
it  is  shown  that,  under  certain  conditions,  the  model  is 
secure.  Data  perturbation  and  rounding  are  proposed  to 
remove  some  of  the  disadvantages  of  the  model.  Some  security 
measures  which  help  the  DBA  to  assess  how  secure  the  SDB  is 
at  a  certain  time  are  defined. 

Using  a  formal  framework  in  Chapter  4,  the  design  of  an 
SDB  which  employs  security  constraints  at  the  conceptual 
data  model  level  is  investigated.  Three  redundant, 
structured  and  semantic  data  models  are  analyzed  for  their 
suitability  as  a  conceptual  model  of  the  SDB.  In  the  SDB, 
information  revealed  to  users  is  well-defined  in  the  sense 
that  it  can  at  most  be  reduced  to  indivisible  information 
involving  a  group  of  individuals.  Any  information  involving 
few  individuals  (and  therefore  risking  compromise)  is 
recorded  and  kept  for  auditing  purposes. 

In  Chapter  5,  the  design  of  SDB  is  extended  with  a 
Question-Answering  System,  a  security  kernel  and  a  set  of 
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security-related  high  level  commands. 

Finally,  the  overall  significance  of  all  the  results 
and  an  overview  of  the  motivation  for  the  research  is 
outlined.  Outstanding  problems  and  areas  that  require 
further  investigation  are  indicated. 
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CHAPTER  2 


SDB  MODELS  AND  PROTECTION  SCHEMES 

Proposed  protection  policies  for  SDB  models  are 
discussed  and  evaluated  using  a  set  of  "quality"  criteria. 

2.1  SDB  Models 

Below  is  an  SDB  model  proposed  by  Denning  [Denning, 
1978;  Denning  et  al . ;  1979]. 

A  statistical  database  can  be  viewed  as  a  set  of  n 
records.  Each  record  has  k  attribute  (property)  values 
corresponding  to  attributes  A^  ,  ,  •  •  •  /  A-^ ,  among  which  some 

are  protected  attributes.  Values  of  the  protected  attributes 
for  each  record  are  confidential  and  only  statistical 
summary  information  about  these  attributes  is  available.  An 
example  is  a  database  of  employees.  Each  record  has 
attributes  NAME,  ADDRESS,  SEX,  AGE,  POSITION,  etc.  and  a 
protected  attribute  SALARY. 

A  query  is  some  statistical  function,  e.g.  MEAN, 

MEDIAN,  etc.,  applied  to  some  subset  of  the  records  in  the 
database.  Every  query  has  a  characteristic  expression  C 
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which  is  a  logical  expression  using  the  logical  operators, 
conjunction  (&),  disjunction  (v)  and  negation  (~).  The  set 
of  records  satisfying  the  characteristic  expression  C  of  a 
query  is  called  the  query  set,  S(C).  For  example,  in  the 
database  of  employees,  C= ( ( AGE< 40 ) & ( POSITION=progr ammer ) )  is 
a  characteristic  expression  and  its  query  set  S(C)  contains 
all  the  programmers  under  40  years  of  age. 

The  most  common  statistical  query  types  can  be  defined 
as  fol lows : 

COUNT ( C ) = I S ( C ) I ,  the  size  of  S(C) 

t  It. 

SUM ( C , A . ) =s um  of  the  i  attribute  values  of  those  records 

l 

in  S ( C ) ,  liiik 

AVERAGE(  C ,  A^ )  =average  of  the  i^  attribute  values  of  those 

records  in  S(C),  liiik 

t  It 

MAX ( C , A^ ) =the  maximum  of  the  i  attribute  values  of  those 

records  in  S(C),  liiik 

MIN(C,A^)=the  minimum  of  the  i^  attribute  values  of  those 

records  in  S(C),  liiik 

MED IAN ( C , A^ ) =the  median  of  the  l  attribute  values  of 

those  records  in  S(C),  liiik 

The  database  is  compromi sable  if  one  can  deduce  from 
the  responses  of  the  queries  some  protected  attribute  values 
of  records.  Clearly  this  definition  is  less  general  than  the 
one  given  in  the  introduction  since  although  an  attribute 
may  not  be  confidential  it  may  nevertheless  be  necessary  to 
protect  it  if  it  leads  to  the  derivation  of  some 
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confidential  information  not  necessarily  in  the  database. 

In  some  other  SDB  models  key-specified  queries  are  used 
to  describe  query  sets  [Dobkin  et  al . ,  1979;  DeMillo  et  al .  , 
1978;  Reiss,  1979a].  Binary  k-bit  keys  are  used  to  describe 
attribute  values  of  records  and  the  k-bit  queries  with  0's, 
l's  and  * ' s  (don't  care)  for  the  query  sets  [Kam  and  Ullman, 
1977;  Chin,  1978].  An  m-response  system  suppresses  answers 
to  queries  with  the  query  set  size  less  than  m. 

2.2  Protection  Schemes 

The  proposed  protection  schemes  may  be  classified  into 
the  following  six  categories: 

1)  controlling  the  size  of  the  query  set, 

2)  limiting  excessive  overlap  between  query  sets, 

3)  partitioning  the  database, 

4)  output  perturbation, 

5)  random  sampling, 

6)  data  distortion. 

In  general  protection  schemes  impose  restrictions  on 
the  system.  In  order  to  compare  the  "quality"  of  these 
schemes,  the  following  factors  will  be  considered: 

(a)  Effect iveness :  restrictions  should  guarantee 
security  to  a  reasonable  extent.  We  will  also  discuss  the 
effectiveness  of  restrictions  under  dynamic  databases  and 
users'  knowledge  about  the  real  world  that  the  database 


models . 
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(b)  Feas ibil i ty ;  it  is  possible  that  some  restrictions 
are  sufficient  to  guarantee  security  but  the  system  has  no 
way  to  enforce  them.  The  enforcement  of  restrictions  should 
be  feasible. 

(c)  Efficiency :  the  implementation  should  be  efficient. 
Any  scheme  which  is  virtually  impossible  to  implement  or 
involves  too  much  overhead,  should  be  avoided. 

(d)  Richness :  restrictions  when  applied  should  not 
conceal  too  much  information.  In  other  words,  the  database 
should  still  be  rich  enough  to  be  useful  for  users. 

Protection  by  Controlling  the  Size  of  the  Query  Set: 

One  of  the  earliest  and  most  straightforward  protection 
schemes  is  to  suppress  queries  whose  query  set  size  is  small 
[Hoffman  and  Miller,  1970;  Hansen,  1971].  The  examples  given 
in  Section  1.2  illustrates  that  the  security  of  the  database 
is  endangered  by  allowing  answers  to  queries  with  small 
query  set  size.  In  [Chin,  1978],  an  m-response  system  using 
k-bit  binary  keys  to  describe  characteristics  of  records  is 
introduced.  It  allows  only  SUM  and  COUNT  queries  and 
prohibits  answering  those  queries  whose  query  sets  have  less 
than  m  records.  Necessary  and  sufficient  conditions  to 
quarantee  the  security  of  a  2-response  system  is  given  for  a 
static  database  (i.e.,  no  insertions,  deletions  and  changes 
in  the  database) .  Unfortunately  this  result  imposes  too  many 
restrictions  on  the  system  and  limits  the  richness  of  the 
database.  Moreover,  as  illustrated  by  the  second  example  in 
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Section  1.2,  since  any  two  users  of  a  particular  kind  in  a 
2-response  system  might  easily  know  each  other,  it  is  a 
generally  accepted  practice  to  have  an  m-response  system 
with  mk3 .  However,  the  properties  of  a  m-response  system  for 
m>3  are  not  well  understood. 

It  can  be  shown  that  information  can  also  be  deduced  if 
queries  with  a  large  query  set  size  are  answered  [Denning  et 
al. ,  1979].  Thus  the  system  should  limit  queries  with  very 
small  and  very  large  query  sets.  Unfortunately,  this 
protection  scheme  can  easily  be  subverted  by  a  device  called 
tracker  [Schlorer,  1975;  Denning  et  al . ,  1979].  A  tracker  is 
some  auxiliary  characteristic  expression  which,  when  added 
to  the  original  characteristic  expression,  produces  an 
answerable  query.  The  user  then  uses  this  answer  with  some 
others  to  deduce  the  answer  for  the  original  unanswerable 
query.  This  idea  is  further  extended  to  double  trackers  and 
general  trackers  which  are  applied  to  more  restricted  ranges 
of  answerable  query  set  sizes. 

For  key-specified  MEDIAN  queries,  the  number  of  queries 

to  compromise  the  SDB  is  lower  bounded  by  O(log2k)  queries 

2 

and  upper  bounded  by  O(log2k)  queries  [DeMillo  and  Dobkin, 
1979;  Reiss,  1979]. 

The  above  results  show  that  the  technique  of 
controlling  the  size  of  the  query  set  is  not  effective 
(although  feasible)  and  merely  makes  the  intruder's  job 
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harde  r . 

Protection  by  Limiting-  Excessive  Overlap  Between  Query  Sets : 

The  protection  scheme  of  limiting  query  set  overlap 
assumes  a  static  database  of  n  records  and  a  fixed  query  set 
size  k,  and  inhibits  the  responses  to  queries  whose  query 
sets  have  too  much  overlap  (say,  more  than  r  records)  with 
the  query  sets  of  other  answered  queries.  It  is  shown  that, 
for  SUM  statistical  queries,  the  smallest  number  of  queries 
sufficient  to  compromise  the  SDB  is  lower  bounded  by  S=(2k- 
(t+l))/r,  where  t  is  the  number  of  records  whose  protected 
attribute  values  are  known  by  the  user  [Reiss,  1979].  It  is 
also  shown  that  this  bound,  S,  is  optimum  for  r=l  and  t=0,l 
[Dobkin,  1979].  There  are  two  problems  with  the  protection 
by  controlling  query  set  overlaps.  First,  it  may  not  be 
feasible  since  extensive  set  intersection  checks  are 
required.  Second,  since  a  previously  answered  query  may 
inhibit  the  responses  of  several  other  more  useful  queries, 
this  protection  scheme  may  severely  limit  the  richness  of 
the  database. 

Threat  monitoring  is  another  proposed  scheme  that  does 
not  guarantee  security  but  is  claimed  to  provide  a  deterrent 
for  intruders  [Hoffman,  1977;  Denning,  1978].  The  system 
monitors  queries  that  have  been  answered  and  tries  to  detect 
excessively  active  periods  of  use  of  a  database  and  to 
detect  instances  of  many  successive  and  similar  queries. 
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This  scheme  however  can  easily  be  made  ineffective  by 
masking  queries  [Schlorer,  1976]. 

Protection  by  Partitioning  the  Database : 

In  this  protection  scheme  the  whole  database  is 
partitioned  into  groups  of  records,  each  of  which  are  in 
partitions  defined  over  k  attribute  domains  [Yu  and  Chin, 
1977] .  Each  partition  has  either  no  records  or  at  least  u 
records  with  u>l.  Queries  are  modified  to  report  over 
partition  boundaries.  In  other  words,  queries  always  involve 
pr e-speci f ied  groups  of  records  and  never  subsets  of  these 
groups.  Thus  records  inside  a  group  cannot  be  isolated  by 
overlapping  queries  and  only  information  concerning  whole 
groups  can  be  derived.  However,  if  the  database  is  dynamic, 
i.e.  insertions,  deletions  and  updates  of  records  are 
allowed  then  each  change  in  a  group  can  be  detected  and  the 
changed  record  can  be  disclosed.  To  illustrate  various 
compromises,  consider  the  example  in  Figure  2.1  which 
contains  the  customer  accounts  of  a  database  and  its 
partitioned  model.  Assume  a  query  is  presented  for  the  total 
accounts  of  engineers  with  annual  income  >  $38,000.  This 
query  is  modified  so  that  it  returns  the  total  accounts  of 
the  engineer  customers  with  salary  >  $40,000  (i.e.  partition 
p^  in  Figure  2.1).  Since  the  records  corresponding  to 
engineer  customers  with  income  >  $40,000  form  a  partition 
(p^)  and  since  there  is  only  one  customer  record  in  that 
partition,  the  information  in  that  record  can  be  extracted. 
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Income 

Profession 

Account 

$10,000 

Engineer 

$167 

$15,000 

Politician 

$410 

$45,000 

Lawyer 

$24 

$20,000 

Politician 

$4210 

$8,000 

Engineer 

$325 

$30,000 

Doctor 

$12,000 

$25,000 

Doctor 

$500 

$50,000 

Lawyer 

$300 

$60,000 

Lawyer 

$123 

$90,000 

Engineer 

$20,000 

(a) 


Records 


in  the  database 


A. I . >  $40,000 


$20, 000<A. I . <$40,000 


A. I. <$20,000 


Pi 

r10 

j 

P2 

1-  . 4 

p3 

r  3  '  r  8  '  r  9 

L  4 

P4 

1- 

p5 

p6 

p7 

Ps 

^  fi  i  £  7  1 

1  6  7  1  1 

-1-4-  4- 

P9 

P10 

P11 

P1 2 

rl'r5 

r2'r4 

(b)  Partitioned  Model 


Figure  2.1.  Partitioned  Database  of  Customer  Accounts 
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Assume  a  new  politician  customer  with  annual  income 

$12,000  has  opened  an  account.  If  a  user  knows  that  the 
politician  customer  N.^  has  annual  income  i  $20,000  and  that 
has  recently  opened  an  account  in  the  bank  then  he  can 
deduce  the  amount  in  N.^  '  s  account  by  querying  the  partition 
Pl2  before  and  after  the  insertion  of  N^'s  record.  Thus 
changes  in  the  database  must  be  processed  in  a  controlled 
manner . 

The  partitioning  model  has  the  following  deficiencies. 

(1)  Modifying  queries  to  report  over  partition 
boundaries  may  conceal  too  much  information  unless  the 
partition  sizes  are  small,  uniform  and  independent  of  the 
distributions  of  records'  attributes.  However,  in  order  to 
avoid  partitions  with  less  than  u  records,  variable  size 
partitions  are  proposed  [Yu  and  Chin,  1977],  and  unless  u  is 
small,  this  condition  may  create  large  partitions  reducing 
the  usefulness  of  the  database.  Also  checking  and  modifying 
queries  so  that  they  report  over  partition  boundaries,  and 
accessing  nonuniform  partition  boundaries  may  be  costly. 

(2)  If  the  database  is  currently  undergoing  changes 
(i.e.  insertions,  deletions  and  updates)  then  each  change 
has  the  danger  of  being  detected  and  the  values  of  those 
records  involved  may  be  disclosed.  In  [Yu  and  Chin,  1977], 
it  is  suggested  that  these  changes  be  classified  into  three 
different  groups  (insertions,  deletions  and  updates)  and  any 
changes  in  a  partition  should  not  be  implemented  until  there 
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are  t  changes  in  the  same  group  for  that  partition.  This 
policy  introduces  an  error  in  SUM  queries,  which  is 
dependent  on  the  values  of  records  with  changes  and  for 
large  t,  this  error  may  reduce  the  usefulness  of  the 
statistical  information.  In  the  other  extreme,  however,  if  a 
change  is  implemented  immediately  after  it  is  requested  then 
by  querying  before  and  after  the  change,  the  information  in 
the  record  involved  with  the  change  can  be  disclosed. 

(3)  In  certain  cases,  users  may  already  know  some 
record  values  from  sources  other  than  the  database.  When 
this  happens,  some  mechanisms  are  needed  to  decide  which 
other  record  values  are  in  danger.  In  other  words,  exact 
information  revealed  to  users  must  be  recorded  and  kept  for 
auditing  purposes. 

(4)  Some  records  may  contain  more  sensitive  information 
than  the  others  and,  while  performing  changes  in  these 
records,  the  database  system  may  want  to  ensure  that  their 
disclosure  is  independent  of  the  disclosure  of  other  record 
values.  In  other  words,  the  database  system  may  want  to 
exercise  some  control  about  the  information  revealed  to 
users  during  changes  in  the  database. 

In  Chapter  3,  some  proposals  are  made  to  remedy  these 
deficiencies . 

A  variant  of  partitioning  is  grouping  (or 
microaggregation)  which  is  used  in  off-line  applications 
such  as  census  publications  [Hansen, 1971 ;  Fellegi  and 


18 


Phillips,  1974]  .  Records  are  grouped  together  and  only- 
aggregate  statistical  information  is  given.  These  two 
techniques,  partitioning  and  grouping,  have  also  the 
limitations  of  possible  loss  in  the  richness  of  the  system, 
especially  when  the  groups  are  ill-formed. 

Protection  by  Output  Perturbation: 

All  the  protection  methods  discussed  so  far  provide  the 
user  exact  statistical  information  of  the  query  set.  Another 
protection  scheme  is  to  perturb  the  responses  to  the  queries 
without  losing  too  much  of  the  meaningfulness  of  the 
information  in  the  database.  Rounding  [Hansen,  1971; 
Nargundkar  and  Saveland,  1972;  Achugbue  and  Chin,  1979]  is  a 
technique  commonly  used  in  off-line  cross-tabulations 
published  by  census  offices.  Instead  of  true  values,  rounded 
values  are  returned  to  the  user.  Introducing  randomization 
into  the  rounding  process  is  expected  to  enhance  security. 
However,  since  correct  answers  can  be  deduced  by  averaging 
responses  to  queries,  this  randomization  should  in  fact  be  a 
pseudo-process,  producing  the  same  responses  to  the  same 
queries.  Unfortunately,  even  with  random-rounded  tables, 
compromise  is  still  possible.  Also,  rounding  methods  assume 
a  static  database  with  no  user  knowledge  of  protected 
values,  thus  their  effectiveness  is  limited. 

One  interesting  study  on  SDB  security  allows  the  system 
to  "lie"  [DeMillo  et  al . ,  1978].  Response  to  a  query  for  the 
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median  of  a  key-specified  query  set  may  be  the  value  of  any 
arbitrary  record  in  the  query  set.  It  is  shown  that 
compromise  is  still  possible  with  0(k  )  queries,  where  k  is 
the  fixed  query  set  size. 

Protection  by  Random  Sampling : 

Sampl ing  the  database  is  another  technique  which  does 
not  always  give  true  answers  to  queries  [Hansen,  1971]  .  Only 
a  small  sample  of  the  entire  database  is  used  for  answering 
queries.  The  U.S.  Census  Bureau  has  used  the  principle  of 
random  sampling  of  records  with  the  sampling  ratio  0.001. 
Since  the  set  of  records  is  no  longer  selected  by  users,  the 
chances  of  compromise  is  small. 

Denning  [Denning,  1979]  proposed  random  sampling  for 
on-line  databases  in  which  large  samples  of  query  sets  are 
used  for  answers.  Queries  for  frequencies  and  averages  are 
computed  using  random  samples  drawn  from  the  query  sets.  It 
is  shown  that  the  relative  error  in  the  statistics  decreases 
as  the  query  query  set  size  increases;  and  the  effort 
required  to  compromise  increases  with  the  query  set  size  due 
to  larger  absolute  errors.  However  the  database  is  assumed 
static  and  users'  supplementary  knowledge  is  not  considered. 
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Protection  by  Data  Distortion : 

In  [Dalenius,  1977],  the  notion  of  "statistical 
disclosure"  is  proposed:  a  disclosure  has  occurred  if,  using 
information  from  a  series  of  queries,  users  can  estimate  a 
database  value  more  closely  than  was  possible  without 
this  information.  Under  this  definition,  however,  disclosure 
cannot  be  prevented;  it  can  only  be  controlled  [Dalenius, 
1977] . 

In  one  approach,  records  are  stored  together  with  a 
permanent  "perturbation  factor",  and  responses  to  SUM 
queries  contain  these  perturbation  factors  [Beck,  1979].  The 
definition  of  compromisabil ity  used  in  [Beck,  1979]  states 
that  the  SDB  is  statistically  compromi sable  if  it  is 
possible  to  estimate  data  value  y^  with  y^  such  that 

St-dev(y^)  <  c | y^  -  Mean(y)| 

where  St-dev  is  the  standard  deviation,  c  is  a  constant  and 
Mean(y)  is  the  mean  value  of  all  y^ ' s  in  the  SDB.  This 
approach  assumes  a  static  database  and  does  not  consider 
users'  supplementary  knowledge  of  protected  data. 


CHAPTER  3 


PARTITIONING  MODELS  WITH  DATA  AND  OUTPUT  PERTURBATION 


Variations  of  the  partitioning  model  are  investigated 
to  remove  some  of  the  disadvantages  of  the  model.  Some 
simple  security  measures  are  defined  by  means  of  an 
undirected  graph  to  help  the  DBA  to  assess  the  security  of 
the  SDB  at  a  certain  time.  The  partitioning  models  of  this 
chapter  will  be  utilized  in  the  context  of  an  SDB  design  in 
Chapters  4  and  5. 

3.1  A  Partitioning  Model 


Consider  a  database  in  which  every  record  r^  has  k 
attribute  domain  values  and  one  protected  domain  value,  v^; 
each  record  belongs  to  some  partition  p^,  1  i  j  <  m,  which 
has  k  dimensions  defined  over  k  attribute  domains.  The 
database  of  customer  accounts  in  Figure  2.1  of  Section  2.2 
is  partitioned  according  to  attribute  domains  income  and 
profession,  i.e.  partitions  are  2-dimensional.  Each  query  q 
is  modified  to  report  over  partition  boundaries  and  the 
system  returns  the  sum,  S(q),  and  count,  C(q),  of  records  in 
partitions.  In  addition,  changes  in  the  database  such  as 


insertions,  deletions  and  updates  are  also  allowed.  Clearly 
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this  model  is  a  variation  of 
that  a  query  set  consists  of 
partitions . 


the  SDB  model  in  Section  2.1 
records  of  a  group  of 


in 


The  following  assumptions  are  made  about  the  database 
and  the  users. 

(a)  As  discussed  in  Section  2.2,  if  a  change  is 
implemented  as  soon  as  it  is  requested  then  there  is  a 
danger  of  disclosure.  Changes  in  a  partition  are  assumed  to 
be  processed  in  pairs. 


(b)  An  update  in  the  attribute  domain  values  of  a 
record  may  cause  the  record  to  move  from  one  partition  to 
another.  It  is  assumed  that  an  update  operation  can  be 
replaced  by  a  pair  of  insertion  and  deletion  operations. 


(c)  It  is  assumed  that  the  deleted  records  are  not 
normally  re-inserted  into  the  database,  or  if  they  are  re¬ 
inserted,  they  have  independent  protected  domain  values.  For 
example,  for  a  statistical  database  of  customer  accounts  in 
a  bank  it  is  assumed  that  when  a  data  person  closes  his 
account  and  re-opens  it  some  time  later,  the  amounts  in  his 
new  and  old  accounts  are  independent  of  each  other. 


(d)  The  user  is  assumed  to  know  the  properties  of  the 
database  such  as  what  the  partitions  are  and  how  the  system 
processes  queries  and  changes.  This  assumption  is  in  accord 
with  the  U.S.  Privacy  Act  [Privacy  Act,  1974].  It  is  also 
assumed  that  the  user  knows  in  which  partition  each  record 
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belongs.  In  what  follows,  an  undirected  graph  called  the 
information  graph  is  used  to  characterize  the  information 
revealed  to  the  users  and  some  of  the  properties  of 
information  graph  are  presented.  It  is  shown  that  if  each 
partition  has  even  number  of  records  and  no  record  value  is 
known  initially  then  the  database  is  secure. 

3.1.1  An  Information  Graph 

For  any  partition  the  sequence  of  records  to  be 
inserted  and  deleted  forms  the  change  sequence.  Records  in 
the  change  sequence  are  called  dynamic  records,  otherwise 
they  are  called  static  records.  Since  record  changes  are 
processed  in  pairs,  we  form  tuples  of  two  records  when  they 
are  processed  together;  thus  the  change  sequence  is  grouped 
into  a  sequence  of  2-tuples.  Depending  upon  the  operation 
needed  for  each  record  r  ,  r^  in  a  tuple,  there  can  be  three 
different  tuples,  namely,  (r1,r?')  (insertion-insertion 
tuple),  (r^,r^)  (deletion-deletion  tuple),  and  (r1,^) 
(insertion-deletion  tuple).  Since  changes  in  a  tuple  are 
processed  at  the  same  time,  a  tuple  is  unordered. 

Assume  records  r  and  r,  with  values  v  and  v,  are  both 

0  D  0  D 

to  be  inserted  into  (i.e.  the  tuple  (r^,r^))  or  both  to  be 

deleted  from  (i.e.  the  tuple  (r^,r^))  the  partition  p. 

Querying  before  and  after  the  change,  one  can  obtain  the 

equation  v  +v,  =c,  where  c.  is  a  constant.  Similarly,  the 

change  (r^,r^j)  gives  the  equation  vc-v^=c2  where  is  a 


constant.  The  example  below  further  illustrates  derivable 
equations  and  their  equivalent  form. 
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Example  3.1 

Consider  partition  p^  with  the  change  sequence  (in  the 
order  of  occurrence)  r^ ,  r^,  r^,  rd,  r^,  r^,  rd,  r^,  rd, 


i  d  d  d  d  d  i 
r 8 '  r 8 '  r 9 '  r 3 '  r10'  rll'  r 1 2 


static 

r  ec . 

partition  p^ 
Derivable  equations: 


( ' r  2 )  /  ^ r 3 ' r i ^ '  ( r4 / r 5  ^ '  ( rd , r  ^ )  , 

( rd  r 1 )  (rd  rd)  (rd  rd  )  (rd  r1  ) 
' r 7 ' r8  '  r8 ' r 9 ' '  K  3'r10''  [  ll,r12; 

(tuples  to  be  processed) 


vl+v2=ci 


v4+v5=c4 


v6"v7=c5 


V11  V1 2-c8 


V1  v3_c2 


v7  v8-c6 


v3+v10  c3 


v8+v9=c7 
where  c^ ' s  are  constants. 


Equivalent  system  of  equations: 


vl+v2=ci 


v4+v5=c4 


v6+v9=c10 


V11  V12  c8 


v2+v3=c9 


v7+v9=cll 


V3+V10=c3 


V8+V9==c7 


where  c.'s  are  constants. 

l 


Thus  users  may  derive  equations  involving  either  sums 
or  differences  of  two  record  values.  Moreover,  if  there  are 
equations  v  -v^c^  and  vb+Vc=C2  t*ien  one  can  replace  them  by 
the  equivalent  equations  va+vc=c3  and  vb+vc=c2'  w^ere 
c3=cl+c2‘  As  -*-on9  as  t^ie  equations  with  differences  have 
some  record  values  in  common  with  the  sum  equations,  one  can 
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repetitively  change  equations  involving  differences  into 
equations  with  sums.  Eventually  one  will  arrive  at  an 
equivalent  system  of  equations  where  the  set  of  record 
values  v^  involved  in  the  sum  equations  are  totally  disjoint 
with  those  in  the  equations  with  differences. 

In  order  to  characterize  the  properties  of  the 
equivalent  system  of  equations  an  undirected  labeled  graph 
is  employed.  An  undirected  labeled  graph  G^=(V,E)  is  said  to 
be  an  information  graph  of  partition  p^  if  V,  the  set  of 
vertices,  is  the  set  of  dynamic  records  in  p.;  and  E,  the 
set  of  s-  or  d-labeled  edges  representing  the  sum  and 
difference  equations,  is  formed  by  considering  each  tuple  in 
the  change  sequence  one  by  one  and  performing  the  following: 

(a)  if  the  tuple  to  be  processed  is  (r1,^)  or 

ci  D 

(rd,r^)  then  form  the  edge  (r  , r,  )  with  label  s  unless 
a  b  3  a  b 

already  formed.  If  the  tuple  to  be  processed  is  (r1,^) 

3  D 

then  form  the  edge  ( r  , r^ )  with  label  d  and 

(b)  repetitively  replace  the  d-edges  with  s-edges,  for 
example,  if  there  are  two  edges,  say  (r  ,rd)  and  (r^,r  ) 
with  labels  d  and  s  respectively,  replace  the  edge  (r^r^) 
with  the  s-labeled  edge  (r  ,r  ). 

Figure  3.1  contains  the  information  graph  of 
partition  p^  in  the  above  example.  Notice  that  each 
connected  component  in  the  graph  G^  in  Figure  3.1  has  either 
s-labeled  edges  or  d-labeled  edges  but  not  both. 
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Figure  3.1  Information  Graph  of  Example  3.1. 


3.1.2  Properties  of  the  Information  Graph  and  a  Security 
Result 

A  vertex  is  inactive  if  it  corresponds  to  a  deleted 
record  in  the  partition,  otherwise  it  is  called  active . 
Notice  that  because  of  assumption  (c),  a  vertex  can  be 
inactivated  only  once.  Also  the  information  graph  can  be 
extended  only  from  its  active  vertices  by  inactivating  them. 
End  vertices  of  a  path  are  the  two  vertices  at  the  end  of  a 
path,  both  with  degree  one.  Below  the  properties  of  possible 
paths  are  specified. 

Property  1.  In  an  information  graph,  edges  with  label  d  form 
chains  and  all  the  vertices  but  one  of  the  end  vertices  in 
such  a  chain  are  inactive. 

Proof .  Clearly  s-  and  d-labeled  edges  cannot  be  in  the  same 
path  due  to  the  repetitive  deletion  of  d-labeled  edges 
performed  in  definition  (b)  of  the  information  graph.  Thus, 
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it  suffices  to  show  that  d-labeled  edges  form  chains  with 
only  one  active  vertex. 

A  single  isolated  d-labeled  edge  can  be  formed  only  by 

*  jq 

an  insertion-deletion  tuple  (r  ,r/~)  where  r  was  in  the 

a  d  b 

database  when  the  database  was  formed  and  has  not  been 

i 

involved  in  any  change  sequence.  Now  the  chain  r^,ra  has 
only  one  active  vertex,  namely  r  .  It  can  be  extended  only 
by  another  insertion-deletion  tuple  (r* 1 * 3,^)  in  which  case 

C  3 

another  d-labeled  edge  is  added  to  form  the  isolated  chain 
rb,ra,rc  with  r being  the  only  active  vertex.  This 
procedure  can  be  repeated  only  by  introducing  another  d- 
labeled  edge  from  the  active  end  vertex.  Thus  only  chains 
can  be  formed.  # 

Property  2.  In  an  information  graph,  (a)  there  can  be  at 

most  two  active  vertices  in  any  path,  and  (b)  if  there 
exists  a  path  with  two  active  end  vertices,  the  length  of 
the  path  is  always  odd. 

Proof .  The  proof  is  by  induction  on  the  number  of  tuples,  n, 
in  the  change  sequence. 

Basis .  The  property  can  be  easily  verified  for  n  equals 

1  and  2 . 

Induction  Step.  Assume  that  the  induction  hypothesis  is 
true  for  all  n-tuples  and  that  the  (n+l)st  tuple  involves  r 

3. 

and  r,  . 
b 

(a)  If  the  (n+l)st  tuple  is  (r1,r?'),  an  isolated  edge 

3  D 
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(ra,r^)  with  active  vertices  and  is  added  to  the 
information  graph  and  the  property  is  obviously  true. 

(b)  If  the  (n+l)st  tuple  is  (rx,r^),  there  are  three 

a  b 

cases . 

(i)  If  record  r^  has  not  been  involved  in  any  change 
sequence  previously  (i.e.  existed  initially),  an  isolated  d- 
labeled  edge  (r  ,  r,  )  with  inactive  r,  is  formed. 

3  D  0 

(ii)  If  vertex  r^  is  the  only  active  end  vertex  in  a 
chain  u^  with  d-labeled  edges  (from  property  1)  then  r^  is 
inactivated  and  r  becomes  the  only  active  vertex  in  the 
isolated  chain  u^ . 

(iii)  If  vertex  r,  is  connected  to  a  path  u~=r  ,  ..., 

b  c  2  m 

r^,r^  with  an  s-labeled  edge  (r j,r^)  then  two  new  paths  with 
s-edges,  u0=r  ,  ...,r.,r  and  u„=r,  ,r.,r  ,  are  formed  and  r, 
is  inactivated.  Path  u^  has  only  one  active  vertex,  namely 
r  .  Thus  the  property  holds  for  u^  .  By  the  induction 
hypothesis,  if  there  is  another  active  vertex  in  u^,  it  must 
be  of  odd  length  to  r^  and  hence  to  r  .  Also  by  the 
induction  hypothesis,  U2  can  have  at  most  two  active 
vertices  one  of  which  being  r^;  thus  r^  is  inactivated  and 
the  new  path  u^  can  at  most  have  two  active  vertices,  one  of 
which  being  r^. 

(c)  If  the  (n+l)st  tuple  is  (r  ,  r,  ),  there  are  three 

3  0 

C3ses . 

(i)  If  neither  record  r  ,  r^  has  been  involved  in  any 
change  sequence  then  an  isolated  s-labeled  edge  (r  , r,  )  is 

3  0 


formed . 
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(ii)  If  only  one  of  the  records,  say  r  ,  has  not  been 
involved  in  any  change  sequence  then  the  paths  r^  belongs 
lose  their  active  vertex. 

(iii)  If  both  records  r  and  r,  have  involved  in  the 

a  b 

change  sequence  previously  then  either  they  are  the  two 
active  vertices  of  an  odd-length  path  u  or  they  are  the 
active  vertices  of  two  disconnected  paths  u^  and  For  the 

former  u  will  become  an  even-length  closed  path  with  all 
inactive  vertices.  For  the  latter,  (1)  if  and  u2  are  d- 
labeled  chains  then  from  property  1  the  resultant  connected 

chain  has  no  active  vertices;  (2)  if  one  of  u^  or  u2 

initially  has  two  active  vertices  and  the  other  has  a  single 
active  vertex  then  the  new  path  formed  by  joining  u^  and  u2 
will  have  a  single  active  vertex;  (3)  if  u^  has  another 
active  vertex,  say  r^,  and  u2  has  another  active  vertex,  say 
r^,  then  by  induction  hypothesis,  r^  is  at  an  odd  length  to 

r  and  r  .  is  at  an  odd  length  to  r.  ,  thus  the  new  path 

3  J  O 

r ^ ,  . .  . , r^ , r^ ,  .  .  .  , r j  is  of  odd  length  and  has  exactly  two 
active  vertices,  r^  and  r^.  Thus  the  property  2  holds  for 
this  case. 

Since  (a),  (b)  and  (c)  cover  all  possibilities,  the 

property  holds  for  the  (n+l)st  tuple.  # 


Property  3.  Cycles  in  an  information  graph  are  even-length 
and  contain  only  inactive  vertices  and  s-labeled  edges. 

Proof ♦  From  property  1  all  d-labeled  chains  contain  only  one 
active  vertex  and  cannot  form  a  cycle.  Thus  cycles  can  only 
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be  formed  from  s-edges  by  inactivating  two  active  vertices 
in  a  path.  Assume  r&  and  r^  are  two  active  vertices  and  u  is 
the  odd-length  path  of  s-edges  containing  them  (from 
property  2).  Now  when  the  records  corresponding  to  vertices 
and  r^  are  deleted  together  u  becomes  an  even-lentgh 
cycle  containing  only  inactive  vertices  and  s-labeled  edges. 

# 


Below  it  is  stated  that  if  the  users'  supplementary 
knowledge  does  not  include  any  protected  domain  value  v. , 
and  if  each  partition  initially  has  even  number  of  records, 
then  the  partitioning  model  is  secure. 

Theorem .  If  no  protected  domain  value  is  known  initially, 
the  partitioned  database  described  in  Table  3.1  is  secure. 

Proof ♦  Information  graph  of  partition  p^  contains  all  the 
equations  about  the  dynamic  records  of  p^  that  are  revealed 
to  the  user.  From  property  3,  it  is  known  that  all  the 
cycles  in  the  information  graph  representing  the  system  of 
equations  are  even.  Such  a  graph  is  well-known  to  be  2- 
colorable  [Harary,  1969].  Therefore,  one  can  construct  an 
infinite  number  of  solutions  for  this  system  of  equations  by 
adding  an  arbitrary  number  to  all  the  records  which  have  the 
same  color  and  subtracting  the  same  amount  from  all  the 
records  with  other  color.  Thus  protected  domain  values  of 
dynamic  records  cannot  be  disclosed.  Consider  the  static 
records  of  partition  p^;  since  there  is  always  an  even 
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number  of  records  in  p^  (zero  or  2.2),  protected  domain 
values  of  static  records  cannot  be  disclosed  and  the 
database  is  secure.  # 

Table  3.1  The  Partitioning  Model 

(1)  The  database  is  divided  into  disjoint  partitions  and 
each  partition  either  contains  an  even  number  of  records  or 
is  empty. 

(2)  Changes  in  the  database  are  processed  in  pairs. 

(3)  Each  query  is  modified  to  report  over  partition 
boundaries  and  the  system  returns  the  total  number  of 
records  in  the  partitions  covered  by  the  modified  query  q 
and  the  sum  of  their  protected  domain  values. 

Note  that  this  security  result  even  holds  with  some 

supplementary  knowledge  of  users.  For  example,  the  user  may 

know  the  protected  domain  values  of  all  static  records  in  p^ 

except  two  records  r^  and  r^,  then  all  the  protected  domain 

values  of  records  in  p^  are  still  secure  in  the  sense  that 

the  user  cannot  extend  his  supplementary  knowledge.  If  the 

users'  supplementary  knowledge  includes  the  protected  domain 

values  of  some  dynamic  and  some  static  records  in  partition 

p^,  then  all  the  other  dynamic  records  which  have  a  path  to 

the  known  dynamic  records  in  the  information  graph  can  be 

deduced.  Moreover,  if  there  is  only  one  static  record  r^ 

whose  protected  domain  value  is  unknown  among  the  static 

records  in  partition  p^  and  if,  during  the  process  of  the 

change  sequence,  partition  p^  contains  only  rc  and  the  set 

of  disclosed  records,  then  the  protected  domain  value  of  r 

c  c 


may  be  deduced. 
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3.2  Partitioning  Models  with  Data  and  Output  Perturbation 


In  Section  3.1,  an  SDB  model  is  presented  with  the 
following  assumptions. 

(1)  There  are  even  number  of  records  in  each  partition. 

(2)  Changes  in  partition  p^  must  wait  for  some  time 
until  the  next  change  in  p^. 

(3)  Users  do  not  have  any  supplementary  knowledge  of 
protected  property  values. 


When  the  above  assumptions  are  relaxed,  the  model 
becomes  insecure.  Assumption  (1)  introduces  implementation 
difficulties  and  variable-size  partitions.  Assumption  (2) 
introduces  an  error  which  is  dependent  on  the  protected 
domain  values  of  dynamic  records  waiting  to  be  processed. 
Finally,  assumption  (3)  may  not  always  hold  and  may  lead  to 
a  compromise.  In  this  section,  several  variations  to  the 
partitioning  model,  which  utilize  data  and/or  output 
perturbation  are  proposed  and  their  effectiveness  in 
preventing  compromise  is  analyzed. 


3.2.1  A  Partitioning  Model  with  Data  Perturbation 


Assumptions  (1)  and  (2)  can  be  relaxed  by  introducing 
dummy  records  for  each  nonempty  partition  where  the  value  x 
of  a  dummy  record  dr^  is  a  random  variable  with  zero  mean 
and  a  small  variance.  Thus  an  answer  to  a  query  may  contain 
an  error  due  to  dummy  records,  but  this  error  is  pre¬ 
controlled  by  adjusting  the  mean  and  variance  of  the  random 
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variable  whereas  in  the  model  in  Section  3.1,  the  error  is 
dependent  on  the  protected  domain  values  of  dynamic  records 
waiting  to  be  processed  and  is  uncontrollable.  Dummy  records 
may  be  implemented  as  follows:  if  initially  partition  p^ 
contains  odd  number  of  records,  a  dummy  record  is  inserted 
into  the  partition  making  the  number  of  records  in  p^  even. 
Assume  the  following  sequence  of  changes  is  to  be  made  in 
partition  p^  which  initially  has  even  number  of  records: 
r^,  r^,  r^,  r^,  ....  This  sequence  can  be  changed  by 

adding  and  deleting  dummy  records  as  follows:  (r1,dr^), 

0  -L 

(r^,dr^),  (  r^,  dr  ^ )  ,  (r^dr^),  where  dr^, 

j=l,2,...,  are  dummy  records. 


The  distribution  of  the  random  variable  of  dummy 
record  dr^  must  have  certain  properties.  Its  mean  should  be 
zero  and  different  partitions  should  have  independent  random 
variables  so  that  E(  Zn  n x  .  ) =0  (E  is  the  expected  value 
and  Xj,  j=l,2,...,n,  are  identically  distributed, 
independent  random  variables  of  dummy  records  dr^).  This 
property  is  needed  since  for  a  query  involving  several 
partitions  the  expected  value  of  the  error  introduced  due  to 
dummy  records  should  be  zero.  A  normal  distribution  with 
zero  mean  may  be  a  good  choice  for  x,.. 


The  standard  deviation  of  x^  is  particularly  important 
and  dependent  on  the  factors  like  the  properties  of  the 
database,  the  required  level  of  security,  the  accuracy  of 
the  statistical  information  to  be  revealed  to  the  users,  and 
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other  requirements  of  the  database  system.  It  is  the 
database  administrator's  responsibility  to  decide  about  the 
most  suitable  value  for  the  standard  deviation.  For  example, 
it  should  be  a  function  of  (1)  sum  of  the  protected  values, 
v^,  in  Pj  (e.g.  upper  bounded  by  0.05  of  Z  v^,  v^  is  in  p^)( 

(2)  the  distribution  of  v  V  s  in  the  partition  (e.g.  the 

standard  deviation  of  x ,=the  standard  deviation  of  v.),  and 

3  i 

(3)  the  protected  domain  value  of  the  dynamic  record  with 

which  it  forms  a  tuple  in  the  change  sequence  (e.g.  assume 

r  and  dr  .  are  to  form  a  tuple,  if  v  (=annual 
a  3  a 

income )=$200, 000  then  the  standard  deviation  of  x^=$50,000, 
or  if  v  (=annual  income ) =$10 , 000  then  the  standard  deviation 

a. 

of  Xj=$3,000) . 


3.2.2  A  Partitioning  Model  with  Rounding 


Consider  the  partitioning  model  in  Section  3.1.  Assume 
SUM  query  responses  for  each  partition  are  rounded  using  a 
rounding  base  b  and  let  m  be  the  true  answer  to  the  query  q. 
Then  the  rounded  response,  S(q),  is  defined  as 

m  if  r=0 

S(q)=  <  m-r  if  r  <  [_(b-l)/2 J 

m+b-r  if  r  i  |_(b-l)/2j 

where  r=m  -  |_m/bj  *  b  and  b  is  odd.  Clearly, 
m  S  [S(q)-(b-l)/2,  S( q) +(b-l ) /2] . 


Now  consider  partition  p  with  response  S  to  SUM 
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queries.  Assume  S'  becomes  the  response  to  the  user  after 
records  r^  and  r 2  with  property  values  and  v2  are  added 
to  p.  Clearly  one  has 

vx+v2  e  [ (S'-S)-(b-l) ,  ( S ' -S ) +(b- 1 ) ] 

=  [kb-(b-l) ,kb+(b-l) ] 

Later  on,  if  r^  and  r2  are  deleted  together  from  p  one  may 
have  vi+v2  6  [k ' b- (b-1 ) , k ' b+ (b-1 ) ] 

If  k^k ' ,  one  can  deduce  a  range  for  v1+v2  of  size  (b-2). 
Similarly,  if  a  deletion  and  an  insertion  are  processed 
together,  one  can  obtain  a  range  of  size  2(b-l)  for  v2~vl  or 

V1~V2* 

Let  us  now  investigate  the  possibility  of  a  compromise 

if  a  user  knows  only  one  protected  property  value,  say  v^  of 

record  r ^ ,  in  the  partition  p.  Define  the  graph  G'=(E,V)  of 

p  where  the  vertex  set  V  is  the  set  of  records  in  the  change 

sequence  of  p,  and  (r^,rj)  is  an  s  (d)  -labeled  edge  in  E 

iff  r.  and  r.  form  insertion-insertion  or  deletion-deletion 
i  3 

(insertion-deletion  or  deletion-insertion)  tuple  in  the 
change  sequence.  It  is  easy  to  see  that  r^  in  a  cycle  in  G' 
is  a  necessary  condition  for  compromising  the  SDB  as  shown 
in  the  example  below. 


Example  3.2 

Let  q  denote  the  query  for  p  only. 

originally  S(q)=13. 

i  i  d  i  i  i 
Change  sequence:  r i ' r 2 ' rl ' r 3 ' r 5 ' r6 


v^=6  is  known, 

d  i  d  d 
r 3 ' r4 ' r 2 ' r4 


b=5  and 
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r 


G  '  : 


r 


r 


3 


Processing  the  change  sequence  leads  to  the  range  inferences 


vl+v2e  [6,14]  v3_vi€  [1/9]  V4-V3S  [1/9]  v2+v4€  [16,24]. 


Knowledge  of  v1=6  leads  to  v26[0,8],  v36[7,15],  v4€[8,24] 
and  v2€[8,32]  in  the  given  order.  Thus  is  disclosed.  # 

Thus  from  the  above  example  one  can  see  that  a  single 
property  knowledge  may  lead  to  other  disclosures.  However, 
the  probability  of  its  occurrence  is  very  small.  Consider  G' 
in  the  above  example.  For  each  edge  in  the  cycle  one  can 
deduce  the  range  [MIN( x ) , MAX ( x ) ]  for  x=v^+v^  or  x=v^-Vj 
depending  on  whether  it  is  an  s  or  d  edge.  In  order  to  have 
disclosure  in  the  above  situation,  one  must  have  the 
condition  that  (i)  a  portion  of  consecutive  edges  in  the 
cycle  have  MAX(x)-x=0  and  (ii)  other  edges  in  the  cycle  have 
x-MIN( x ) =0 . 

Assume  initially  the  sum  of  protected  property  values 
in  the  partition  p  is  S.  Two  records  with  values  v^  and  v^ 
are  added  to  p.  Let  x=v^+v^ .  We  would  like  to  find  a  range 
for  x  as  [MIN( x ) , MAX ( x ) ] .  Assuming  S^=(S)  (mod  b)  and 
xb=(x)  (mod  b)  are  equally  likely  anywhere  in  [0,  b-1]  (b  is 
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the  rounding  base)  and  independent  from  each  other,  we  have, 


for  xb=i,  0iiib-l , 


[MIN(x) , MAX ( x ) ]=< 


("[(k-Db+l,  (k+1 )  b-1] 

|Jkb+l,  (k+2) b-1] 


with  prob. 
with  prob. 


(b-i)/b 

i/b 


Thus  for  *b=i,  0iiib-l, 


r 

MAX ( x ) -x=< 

L 


(b-i)-i 


2b-l-i 


with  prob.  (b-i)/b 
with  prob.  i/b 


giving 

E ( MAX ( x ) -x ) =b-l 

Simi lar ly , 

E ( x-MIN( x ) ) =b-l 


Also  Prob ( MAX ( x ) -x=0 ) =Prob ( x-MIN( x ) =0 ) =l/b^ .  For  b=10,  this 
probability  is  0.01  and  notice  that  any  cycle  in  G'  consists 
of  at  least  4  edges.  Thus  probability  of  compromise  is 
<<  l/b  (with  the  assumptions  of  equally  likely  and 
independent  x  and  S  values)  . 


Disclosure  due  to  a  single  property  value  knowledge  can 
be  prevented  with  a  simple  strategy  using  the  notion  of 
active  vertex  in  Section  3.1.  For  each  path  with  two  active 
vertices  in  G',  keep  adding  MAX(x)-x  values  into  y^  and  x- 
MIN(x)  values  into  •  Clearly,  any  single  property  value 
knowledge  of  an  intruder  may  decrease  the  ranges  of  other 
property  values,  v^,  in  the  path  to  at  most  y-^+y 2  and  does 
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so  only  when  that  path  turns  into  a  cycle.  Thus  the  database 
system  may  prevent  the  formation  of  a  cycle  using  dummy 
records  if  y1+y2=0. 

3.2.3  A  Partitioning  Model  with  Rounding  and  Data 
Perturbation 

Consider  the  partitioning  model  in  Section  3.2.1/  i.e., 
insertions  and  deletions  are  processed  together  with  dummy 
records.  SUM  query  response,  S(q),  which  covers  partitions 
p^,  liiij,  is  defined  as  follows.  S(q)=  ,  where 

S'^  is  the  rounded  sum  of  property  values  of  records  and  the 
dummy  record  in  p^. 

Clearly  this  model  removes  assumptions  (1)  and  (2)  and 
is  better  than  the  partitioning  models  in  the  previous  two 
sections  (3.2.1  and  3.2.2).  Below  it  is  shown  that  under 
normal  circumstances,  the  expected  error  in  this  model  is 
zero . 


Assuming  N  records  equally  likely  anywhere  in  the 
database  and  n  partitions,  the  probability  that  the  number 
of  records,  Z,  in  partition  p  is  k  is  given  by 


( n-1 ) 


N-k 


Prob(Z=k)=pz(k)=- 


k=0 , 1 , . . . , N 


where  p  is  the  probability  function  for  Z. 

LA 

Thus  Z  has  a  binomial  distribution,  and  its  mean  and 


variance  are 
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(1) 


Var(Z)  =  N( l/n) ( 1-l/n) 

Let  ,  ^2 , . . . ,V^  be  a  sequence  of  independent  and 
identically  distributed  random  variables  (r.v.)  representing 
the  property  values  of  records  in  p,  and  the  mean  and 
variance  of  these  r.v.s  be  denoted  by  and  Var(V).  Let  T 
be  a  r.v.  for  the  dummy  record  in  p  with  mean  MT=0  and 
variance  Var(T).  The  sum  of  property  values  of  records  in  p 


Z 


is  defined  as  S=  >.  V^+T.  Assuming  Z  is  independent  of  ' s 


and  T,  we  have 


E ( S )  =  E(E(S| Z) ) 


and 


E ( S | Z=k )  =  kE(V)  +  Mt  =  kMy 


giving 


E(S|Z)  =  ZMy 


and 


E ( S )  =  E(ZMV)  =  MZ.MV 


(2) 


Also 


Var(S)  =  E ( Var ( S I Z ) )  +  Var(E(s|z)) 


but 


Var ( S I Z=k )  =kVar(V)  +Var(T) 


Var(SlZ)  =  zVar(V)  +  Var(T) 


Var(S)  =  E ( ZVar ( V )  +  Var(T))  +  Var(ZMy) 


Hence 
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=  Var(T)  +  (N/n) .Var(V) 

+  N( l/n) (1-l/n) .M^  (3) 

Let  R=(S)mo^  13*  Thus  the  response  to  a  SUM  query  for 
partition  p  will  be  (S+b-R)  if  R  >  (b-l)/2  or  (S-R)  if 

R  i  (b-l)/2.  Then  the  error  W  will  be 

f~T  +  b-  R  if  R  >  (b-l)/2 

W  =  < 

T  -  R  otherwise 

If  v j_ '  s  and  T  are  symmetrically  distributed  about  and  M^, 
respectively,  we  have 

Lemma .  If  M^ .  M^  =  kb,  k  G  Integers,  b  is  odd,  then 

(a)  ^  =  0 

and  (b)  Var(W)  =  Var(T) 


Proof .  See  Appendix  A 


# 


o 


Thus  if  the  database  system  has  control  over  the  size 
f  the  partitions  (which  may  not  be  possible)  and  \A  ' s  are 
symmetrically  distributed  about  M^,  Mz  (given  by  (1))  can  be 
adjusted  to  obtain  M=0,  i.e.,  the  expected  error  introduced 

in  a  query  is  zero. 


Notice  that  for  a  partition  with  relatively  large 
number  of  records,  using  the  central  limit  theorem  [Mood  et 
al. ,  1974],  S  is  normally  distributed  (and  hence  symmetrical 
about  Mc)  even  if  V.'s  are  not  symmetrically  distributed. 
Thus  the  assumption  of  symmetrically  distributed  \A  ' s  are 
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not  needed  for  partitions  with  sufficiently  large  number  of 
records.  Moreover,  since  the  variance  of  error  is  equal  to 
Var(T),  (i)  it  can  be  adjusted  along  the  line  of  suggestions 
in  Section  3.2.1,  and  (ii)  Var(T)  may  be  increased  for  paths 
with  y^,  y 2  =  0  (described  in  Section  3.2.2)  in  order  to 
have  better  protection. 

3.3  Security  Measures  Related  to  the  Information  Graph 

The  information  graph  in  Section  3.1  has  another  usage. 
Clearly,  users'  knowledge  of  one  property  value  of  an 
individual  r  in  the  model  in  Section  3.1  is  sufficient  to 
disclose  all  other  property  values  of  individuals  that  are 
in  the  same  connected  component  with  r  in  the  related 
information  graph.  Thus,  a  measure  of  security  may  be 
defined  in  terms  of  the  number  of  connected  components  of 
information  graph.  The  reachability  set  Rs  is  defined  as  a 
subset  of  the  individuals  in  partition  p  such  that  they  are 
in  the  same  connected  component.  We  also  define 

no.  of  Rs  in  the  inf.  graph 

Reachability  Constant  w^  - - 

no.  of  vertices  in  the  inf.  graph 

Clearly  0  <  w^i  1,  and  w^  =  1  implies  relatively  "more" 
security  (more  connected  components  in  the  information 
graph)  and  =  0  implies  relatively  "less"  security. 

Another  security  measure  may  be  largest  reachable  set  size 
W2 ,  i.e.  w2=MAX I Rs I  for  all  reachability  sets  Rs .  Thus 
relatively  small  w2  and  Wj^  =  1  implies  relatively  "more" 
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security.  For  example,  if  the  user's  knowledge  includes  x 
dynamic  records  of  partition  p,  he  can  at  most  increase  his 
knowledge  by  (w^-ljx  more  protected  domain  values. 

The  two  described  measures  of  security,  w-^  and  •  are 
not  controllable  in  the  model  described  in  Section  3.1.  For 
some  databases,  dummy  records  may  be  used  to  control  and 
change  the  sizes  of  the  reachability  sets  of  records  and, 
thus,  control  the  security  measures  w^  and  Appendix 

B,  some  ways  of  applying  dummy  records  to  control  those 
paths  with  too  many  vertices  are  briefly  discussed. 


CHAPTER  4 


STATISTICAL  DATABASE  DESIGN 

The  SDB  Security  problem  is  investigated  at  the 
conceptual  model  level.  Three  different  data  models  are 
analyzed  for  their  suitability  as  a  conceptual  model  of  SDB. 
Using  a  formal  framework,  the  design  of  an  SDB  is 
investigated.  Possible  types  of  inferences  are  classified, 
security  constraints  are  defined,  and  enforced. 
Implementation  issues  of  the  design  are  discussed. 

4.1  Introduction 

Below  some  shortcomings  of  previous  studies  on  SDB 
security  are  discussed. 

1)  Statistical  databases  provide  statistical 
information  about  groups  of  individuals  in  the  real  world. 
The  assumption  is  that  statistical  information  about  a  group 
of  individuals  conveys  a  meaningful  aspect  of  that  group  of 
individuals.  However,  statistical  information  about  an 
arbitrarily  chosen  group  of  individuals  may  not  have  a 
useful  meaning  attached  to  it.  Previous  SDB  security  studies 

have  not  dealt  with  the  question  of  whether  or  not 
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statistical  information  is  "meaningful",  and  some  have  set 
forth  questions  (and  given  answers)  with  assumptions  like 
"every  possible  combination  of  records  can  be  requested"  or 
"all  possible  medians  of  any  sets  of  records  are  queriable" 
[Dobkin  et  al . ,  1977;  Dobkin  et  al . ,  1979;  Reiss, 1979]. 

These  types  of  assumptions  cause  an  explosion  in  the 
complexity  of  the  problem,  and  consequently , the  protection 
measures  highly  limit  the  richness  of  the  SDB  (see  Chapter 
3).  However  once  a  proper  definition  of  the  "statistical 
inf ormation"  is  used,  and  an  analysis  of  the  portion  of  the 
real  world  represented  by  an  SDB  is  made  for  determining  its 
statistical  information,  these  combinatorial ly  explosive 
possibilities  perhaps  can  be  reduced  or  even  eliminated. 


2)  The  SDB  models  used  in  previous  studies  (see 
Chapter3)  used  terms  like  records,  record  fields,  etc.. 
Databases  are  more  than  collections  of  records,  and  the 
information  in  databases  may  be  highly  complex.  Databases 
contain  a  model  of  some  portion  of  the  real  world; 
effectiveness  of  protection  measures,  if  treated  at  that 
level,  may  increase.  In  all  previous  studies,  the  SDB  models 
used  were  closer  to  the  physical  level,  rather  than  the 
conceptual  level,  of  the  database.  Thus  they  encountered  the 
problem  of  security  at  a  very  low  level,  that  of  physical 
records.  Although  these  studies  have  contributed  to  our 
understanding  of  the  problem,  their  SDB  models  were 
incomplete,  and  the  results  were  usually  negative  in  tone. 


45 


3)  All  previous  studies  (except  [Yu  and  Chin,  1977]) 
considered  static  databases  in  order  to  simplify  the 
problem.  Changes  may  occur  not  only  at  the  level  of 
insertions,  deletions  or  updates  of  individuals  (i.e., 
primitive  changes  as  discussed  in  Chapter  3),  but  at  the 
level  of  the  conceptual  model  (i.e.,  high-level  changes  such 
as  different  views,  abstractions,  etc.).  The  problem  of  SDB 
security  should  also  be  investigated  for  dynamic  databases 
to  capture  the  dynamics  of  the  real  world. 

4)  In  the  real  world,  users'  additional  (supplementary) 
knowledge  may  take  the  form  of  general  rules,  relationships, 
or  simply,  protected  property  values.  If  the  database 
administrator  (DBA)  is  aware  of  this  information,  effective 
security  measures  can  be  imposed  easily.  The  example  below 
illustrates  this  situation. 

Example  4.1  Consider  a  database  of  employees  of  a  certain 
computer  manufacturing  company  in  which  the  sum  of  salaries 
of  employees  is  queriable.  Assume  the  following  information 
(which  is  not  represented  in  the  database  and  hence  unknown 
to  the  database  system)  exists. 

(a)  Salary  range  of  a  new  systems  analyst  with  B.S.  is 
$[10K, 12K] . 

(b)  Salary  range  of  a  new  systems  analyst  with  M.S.  is 
$[12K, 14K] . 

Now  assume  two  new  systems  analysts  are  hired  and 
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information  about  them  is  inserted  into  the  database.  If  the 
change  in  the  sum  of  salaries  of  systems  analysts  is  $27K, 
then  users  can  conclude  that  the  new  employees  have  Ms 
degrees . 

Most  problems  in  SDB  security  can  be  removed  by  a  good 
model  of  the  real  world  environment  so  that  the  DBA  can  take 
effective  protection  measures.  Thus  existing  relationships 
and  semantics  of  the  information  should  always  be  considered 
for  an  effective  SDB  design. 

5)  When  users'  additional  knowledge  increases,  some 
mechanisms  are  are  needed  for  the  DBA  to  decide  (i)  what 
other  information  has  been  disclosed  by  users,  and  (ii)  what 
protective  measures  should  be  taken.  In  other  words,  exact 
information  revealed  to  users  should  be  kept  in  some  compact 
form  for  auditing  purposes.  Some  previous  studies  proposed 
investigation  of  log  trails  for  auditing  [Hoffman  and 
Miller,  1970;  Dobkin  et  al . ,  1977;  Hoffman,  1977].  However, 
for  very  large  databases,  the  enormous  amount  of  information 
in  log  trails  is  of  little  help  for  checking  security  (not 
to  mention  the  "masking"  of  queries  by  users  [Denning  et 
al.,  1979;  Schlorer,  1976]). 

4.2  Statistical  Database  Design 

This  section  gives  formal  definitions  of  the  SDB  and 
the  security  problem  (section  4.2.2)  and  discusses  the 
design  of  an  SDB  which  employs  security  constraints  at  the 
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conceptual  data  model  level.  Below  are  the  desirable 
features  of  the  SDB  design  in  terms  of  the  "goodness" 
criteria  introduced  in  Section  2.2. 

(a)  Effectiveness  of  the  protection.  In  order  for  the 
SDB  system  to  be  effective,  the  database  should  be  equipped 
with  the  following  information. 

1)  A  "good"  conceptual  model.  As  a  response  to 
problem  2  in  Section  4.1,  the  SDB  security  should  be 
elevated  to  the  conceptual  model  level. 

2)  Well-defined  statistical  information. 

Statistical  information  must  be  well-defined  and  an  analysis 
of  the  specific  information  and  its  statistical  constituents 
should  be  made.  This  will  help  to  reduce  the  size  of  the 
security  problem  (crystallize  the  complex  relationships, 
define  the  information  to  be  secured,  etc.)  and  thus 
eliminate  problem  1  mentioned  before. 

For  the  real  world  model,  the  statistical  information 
revealed  to  users  will  only  be  about  pre-defined  groups  of 
individuals.  The  intersection  of  these  groups  of  individuals 
will  give  a  set  of  indivisible  groups  of  individuals  and  any 
statistical  information  about  these  indivisible  groups  of 
individuals  will  constitute  atomic  information.  Thus  the 
database  system  is  no  longer  interested  in  giving  out 
uncontrolled,  random  statistical  information  to  users,  which 
may  easily  be  exploited,  but  rather  it  will  give  out  well- 
defined  information  that  can  at  most  be  reduced  to  atomic 


information . 


48 


3)  Controlled  changes  in  the  database.  The  DBA 
should  be  equipped  with  data  manipulation  operators,  and 
dynamics,  as  well  as  statics,  of  the  real  world  should  be 
revealed  to  users  (problem  3).  However  this  should  be  done 
in  a  controlled  manner  and  the  information  revealed  due  to 
the  changes  in  the  environment  should  be  recorded  for 
auditing.  (Note  that  1974  US  Privacy  Act  [Privacy  Act,  1974] 
necessitates  the  inclusion  of  changing  aspects  of  the 
environment. ) 

4 )  Information  about  users'  additional  knowledge . 
Users'  additional  knowledge  should  be  maintained  and  kept 
up-to-date  in  the  SDB .  It  is  assumed  that  the  DBA  is 
correctly  informed  about  users'  additional  knowledge  of 
protected  information. 

b)  Efficiency  of  the  protection.  Below  features  of  the 
SDB  to  improve  the  efficiency  of  the  protection  are 
described . 

1)  Disjoint  user  groups  should  be  defined  to 
utilize  the  fact  that  their  initial  knowledge  may  be 
substantially  different  from  each  other  or  they  may  not 
necessarily  have  the  same  access  authorization  to  different 
parts  of  the  database. 

2)  Different  levels  of  statistical  information 
should  be  revealed  to  different  users.  For  example,  some 
users  may  not  be  allowed  to  access  certain  detailed 
statistical  information. 

3)  For  each  group  of  individuals  about  which 
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statistical  information  is  to  be  revealed,  allowable 
statistical  query  types  are  defined.  This  leads  to  different 
security  constructs  and  mechanisms  for  different  types  of 
statistical  information. 

c)  Richness  of  the  information  revealed  to  users. 
Clearly,  investigating  the  security  problem  at  the 
conceptual  model  level  provides  the  database  designers  with 
more  control  over  the  richness  and  usefulness  of  the  SDB. 
However,  atomic  information  should  not  be  further 
decomposable  by  templates  or  by  queries  such  as  join,  select 
and  project  operators  in  a  relational  model  [Codd,  1970; 
Codd,  1974].  It  is  also  assumed  that  the  DBA  confirms  the 
security  and  compatibility  of  any  new  view  before  granting 
access  to  it. 

4.2.1  Constituents  of  Statistical  Information 

Statistics  studies  specific  aspects  of  individuals  in  a 
population  which  may  be  conceptual  or  physical.  The 
individuals  in  the  population  have  something  in  common  so 
that  they  altogether  form  the  population.  Most  statistical 
methods  can  be  viewed  as  ways  of  making  inferences  about  a 
population.  Such  inferences  are  made  after  the  examination 
of  a  "sample"  from  the  population.  A  database  may  contain 
the  whole  population  or  a  sample  of  the  population.  The  user 
may  or  may  not  use  the  statistical  information  for 
statistical  inferences.  In  any  case  the  central  concept  is 
the  population  concept.  For  the  specific  environment  at 
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hand,  once  the  populations  to  be  studied  are  identified  then 
the  individuals  are  no  longer  important,  and  two  individuals 
with  nothing  in  common  will  not  be  included  in  the  answer  of 
the  same  statistical  query. 

The  database  system  should  also  differentiate  the 
quantitative  properties  of  individuals  for  which  statistical 
information  is  to  be  revealed  and  the  defining 
characteristics  of  a  population.  For  example,  "sum  of 
salaries  of  employees"  is  a  quantity  related  with  the 
employee  population  but  "sum  of  salaries  of  employees  where 
salary  >$12K"  gives  information  about  a  different 
population.  Not  distinguishing  this  difference  may  cause 
protection  problems.  Similarly,  for  example,  "number  of 
employees"  and  "number  of  employees  convicted  of  felony" 
give  information  about  two  different  populations. 

4.2.2  Formal  Definitions  of  the  SDB  and  the  Security  Problem 

In  order  to  provide  insight  to  the  features  of  the  SDB 
and  a  framework  within  which  to  investigate  the  SDB  design, 
formal  definitions  of  the  SDB  and  the  security  problem  are 
given  in  this  section. 

In  general,  there  may  be  two  different  approaches  in 
modeling  a  database  system:  set  theoretic  models  or  finite 
state  models.  For  finite  state  modeling  there  may  be  two 
approaches:  (a)  state  snapshot,  in  which  the  rules  are  given 
to  define  valid  states  of  the  database,  (b)  state 
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transition,  in  which  legal  database  operations  are  given, 
and  these  operations  are  guaranteed  to  preserve  the  security 
of  the  SDB.  In  this  section,  a  finite  state  model  of  an  SDB 
with  state  transition  approach  is  described. 

The  database  system  models  some  portion  of  the  real 
world  which  is  called  the  appl ication.  The  application  can 
be  thought  of  as  having  a  state  and  certain  allowed 
transitions  between  states.  The  application  state  represents 
a  "snapshot"  of  the  application  at  a  given  time.  An 
application  is  represented  in  the  database  system  by  an  SDB 
data  model . 


SDB  Data  Model: 


SDB  Data  Model  is  a  system  of  4-tuples 

(Schema,  Query  Types  Set,  Query  Mapping  Function, 
Operation  Types  Set) 

Schema  contains  descriptions  of  populations  and  their 
properties.  For  example,  schema  for  the  D-A  Model  [Smith  and 
Smith,  1977a]  contains  hierarchies  of  object  types.  The 
Schema  for  the  E-R  Model  [Chen, 1976]  contains  the 
descriptions  of  entity  sets  and  relationship  sets.  The  Query 
Types  Set  contains  allowable  statistical  query  types  such  as 
MAX,  MIN,  SUM,  MEDIAN,  etc..  The  Query  Mapping  Function 
identifies  which  statistical  query  types  are  allowed  for 
properties  of  populations  in  the  schema,  i.e.. 
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Query  Mapping  Properties  of 


Subsets  of  the 


--> 


Funct ion 


Populations 


Query  Types  Set 


Operation  Types  Set  contains  two  different  groups  of 
operation  types: 

Operation  Type-1  i  :  Schema  x  Arguments  -->  Schema  liilml 
Operation  Type-2  1  :  Schema  x  Arguments  x  Database  State 


1 < i<m2 


— >  Database  State 


Type-1  operations  correspond  to  high-level  schema 
modification  operations  such  as  "decompose  population"  or 
"delete  population",  etc..  Type-2  operations  correspond  to 
low  level  operations  such  as  insertion,  deletion  or  update 
of  individuals.  Notice  that  (unlike  in  [Borkin,  1978])  the 
schema  is  not  assumed  to  be  static.  The  Database  State  of 
the  SDB  consists  of  each  user  group's  knowledge  set  (to  be 
described)  and  a  representation  of  the  application  state. 

The  representation  of  the  application  state  may  be  a  mapping 
from  the  schema  to  the  subsets  of  individual  objects  and 
relationship  tuples  specifying  the  individuals  belonging  to 
the  populations  in  the  schema. 

Given  a  schema  and  a  set  of  arguments  for  operation 
type-1,  one  can  generate  operation  (i)  corresponding  to 
argument  (i)  as 


Operation  (i)  :  schema  -->  schema 
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Or  given  a  schema,  a  set  of  arguments  and  the  database 
state,  one  can  generate  operation  (i)  corresponding  to 
argument  (i)  as 

Operation  (i)  :  Database  State  — >  Database  State 

SDB  Knowledge  Base: 

SDB  Knowledge  Base  is  a  system 

(User  Groups  Set,  Knowledge  Sets, 

Knowledge  Base  Operation  Types  Set) 

User  Groups  Set  consists  of  user  groups,  and  each  user 
belongs  to  one  user  group.  For  each  user  group,  there  is  one 
Knowledge  Set  containing  the  users'  additional  knowledge 
about  the  application,  which  is  in  the  form  of  explicit 
facts  and  general  rules. +  Explicit  facts  can  be  represented 
by  a  set  of  predicates  describing  either  the  relationships 
between  entities  in  the  application  or  the  property  values 
of  entities.  Knowledge  Base  Operation  Types  Set  contains 
available  knowledge  base  operation  types,  and  given  a  set  of 
arguments  and  a  knowledge  set,  one  can  generate  the 
operation  corresponding  to  any  operation  type  as 

Operation  :  Knowledge  Set  — >  Knowledge  Set 


+An  example  to  general  rules  may  be  "Every  programmer  has  a 
B.S.  degree".  [Minker,  1978]  refers  to  general  rules  as 
"axioms " . 
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Since  the  only  database  states  of  concern  are  those 
which  can  be  reached  by  the  set  of  allowed  operations,  the 
set  of  valid  database  states  can  be  defined  as  consisting  of 
some  initial  state  and  those  states  consisting  of  the 
closure  of  the  SDB  data  model's  set  of  allowable  operations 
and  knowledge  base  operations  of  user  groups. 

Statistical  Database: 

The  SDB  specifies  the  SDB  data  model,  the  current 
database  state,  all  possible  state  transitions  and  a  set  of 
security  constraints  about  the  representation  of  the 
application  state.  Security  constraints  are  dynamic 
conditions  that  are  always  satisfied  at  any  database  state. 
Security  constraints  are  dependent  on  users'  additional 
knowledge,  representation  of  the  application  state,  the 
query  mapping  function,  etc..  Security  constraints,  when 
applied  to  the  user  group's  statistical  queries,  are  in  the 
form  of  either  "suppress  user  group  u's  statistical  queries 
of  type  i  for  population  p  if  condition  C  holds"  or  "remove 
individuals  x,y, ...,z  from  statistical  queries  of  type  i  for 
population  p  if  condition  C  holds"  .  Thus  the  SDB  is  a  system 

(Data  Model,  Database  State,  Security  Constraints) 

Security  Problem  of  the  SDB: 

For  each  user  group,  the  set  of  explicit  facts  relevant 
to  the  application  represented  by  the  SDB  is  classified  as 


- 
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confidential  (F  )  or  nonconf idential  (F  )  and  known  ( F,  )  or 

c  n  k 

unknown  (F  ) (See  Figure  4.1).  (Note  that  Fv  is  represented 

LI 

by  the  Knowledge  Set  and  the  representation  of  the 
application  state.)  Compromise  (or  disci osur e )  occurs  if  a 
user  in  a  user  group  changes  F  (i.e.  F  &  F  )  into 

LI  /  C  LI  C 

F'  where  F'  C  F  +  .  Similarly,  the  set  of 
u,  c  u,  c  u,  c  1 

general  rules  is  classified  as  known  ( I,  )  and  unknown  (I  ), 

K  LI 

and  the  unknown  set  of  general  rules  is  classified  as 
strongly-compromising  (I  ),  weakly-compromising  (I  )  and 
safe  (I  ,.)  as  follows: 

S  t 


(1)  Let  p^,  liiik,  and  k  G  integers,  be  a  disclosure 

procedure  which,  in  order  to  compromise  the  SDB,  uses  only 

some  known  facts  and  general  rules  and  a  nonempty  subset  1^ 

of  the  unknown  general  rules  set,  1^.  Then  the  s trongly- 

compromismg  set  of  general  rules  is  defined  as  I  ^  = 
k  St 

i-V* 


(2)  Let  p j ,  lijit  and  t  G  Integers,  be  a  disclosure 
procedure  which  uses  only  some  known  facts  and  general 
rules,  a  nonempty  subset,  F-^  ,  of  unknown 

nonconf idential  facts  and  a  nonempty  subset  1^  of  unknown 

general  rules.  Then  the  weakl y-compromising  set  of  general 

t 

rules  is  defined  as  I  =  U  I-  -  I  . .  Moreover 

t  w  j=i  3  st 

we  define  F’  =  u  F3  .  Clearly, 
u ,  n  j=1  u , n  11 

F'  C  F 
u ,  n  —  u ,  n 


+ 


C  means  "proper  subset" . 
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The  set  of  safe  general  rules  is  defined  as 


I  ,  =  I  -  (I  U  I  ,  ) 

S  f  U  W  St 


Clearly,  at  any  valid  database  state,  is  closed  over 

1-^  in  the  sense  that  any  disclosure  procedure  using  only 
known  facts  and  known  general  rules  can  only  infer  known 
explicit  facts. 


For  the  security  of  the  SDB,  the  set  of  unknown 

confidential  facts,  F  ,  and  the  strongly-compromising 

u  *  c 

set  of  general  rules,  I  ,  should  be  protected.  Moreover  the 

S  L 

database  system  should  protect  either  F'  or  I  or  a 
2  ^  u ,  n  w 

subset  F"  of  F'  and  a  subset  I'  of  I  such 
u ,  n  u ,  n  w  w 

that  there  does  not  exist  any  disclosure  procedure  which 

uses  only  (F*  -  F"  ),  F,  and  (I  -  I'). 

2  u, n  u,n  k  w  w 

The  security  problem  of  SDB  can  now  be  defined  as 

follows:  the  SDB  is  secure  at  a  database  state  iff  F^  is 

closed  and  the  database  system  ensures  protections  of 

F  ,  I  ,  and  either  F'  ^  or  I  or  ( F "  „ 
u,c  st  u,n  w  u,n 


and 
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Set  of  explicit  facts 


Set  of  general  rules 


Figure  4 . 1.  Classi f ication  of  explicit  facts  and  general 
rules  relevant  to  the  application 
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4.2.3  Conceptual  Data  Models  for  SDB  Design 

The  SDB  data  model  should  be  similar  to  the  conceptual 
data  model  of  any  other  general-purpose  database.  There  are 
several  reasons  for  this  requirement,  besides  effectiveness 
of  the  protection  and  the  richness  of  the  SDB. 

1)  For  some  users  and  at  least  for  the  DBA,  the  SDB  is 
just  a  normal  database  and  these  users  should  have  access  to 
all  information  in  the  database  (not  just  aggregates). 

2)  It  is  necessary  to  have  total  information  about  the 
environment  in  order  to  enforce  security,  integrity  and 
validity  in  the  database. 

Although  Section  4.2.2  gives  the  definition  of  the  SDB 
data  model  it  does  not  advocate  any  specific  data  model. 

With  the  recently  renewed  interest  in  conceptual  data 
models,  over  thirty  different  data  models  are  mentioned  in 
[Kerschberg  et  al . ,  1976;  Nijssen,  1976].  From  the  security 
viewpoint,  our  concern  is  twofold.  We  are  concerned  about 
the  structure  of  the  conceptual  data  model  in  order  to 
define  atomic  information  and  to  give  out  controlled 
statistical  information.  We  are  also  concerned  about  the 
semantics  and  the  redundancy  of  the  conceptual  model  in 
order  to  successfully  mirror  the  real  world  environment  so 
that  (a)  the  security  measures  can  easily  and  naturally  be 
provided  and  (b)  the  database  is  still  a  highly  rich  and 
useful  one  for  users.  Thus  a  structured,  semantic  and 
redundant  conceptual  model  for  the  SDB  is  required.  In  this 
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chapter,  the  Data  Abstraction  Model  ( D-A  Model)  [Smith  and 
Smith,  1977a;  Smith  and  Smith,  1977b]  will  be  used  in  the 
design  of  the  SDB.  The  choice  of  the  D-A  Model  among  other 
structured,  semantic  and  redundant  data  models  is  motivated 
by  the  ease  in  applying  protection  measures  without  bringing 
many  extra  constructs  and  restrictions  to  the  conceptual 
model.  However,  the  SDB  design  may  easily  be  modified  for 
any  other  structured,  redundant  and  semantic  data  model;  and 
below,  two  other  data  models,  namely,  the  Entity- 
Relationship  Model  [Chen,  1976]  and  the  Extended  Relational 
Model  [Codd,  1979],  are  discussed  for  their  suitability  as  a 
conceptual  model  of  the  SDB.  It  should  be  noted  that  the  aim 
is  to  investigate  the  needed  modifications  (i.e.  rules  and 
constructs)  in  order  to  define  populations  clearly  and  to 
analyze  and  control  inferences.  Thus  the  data  models  are 
only  briefly  described,  and  other  issues  such  as  expressive 
power,  semantics,  naturalness,  etc.,  of  the  data  models  are 
not  discussed. 

The  Data  Abstraction  Model: 

In  this  section,  the  D-A  Model  for  SDB  design  is 
summarized  and  modified.  Our  goals  are  to  augment  the 
conceptual  data  model  with  the  population  concept  and  to 
identify  atomic  information. 

Smith  and  Smith  introduce  two  kinds  of  database 
abstractions.  Aggregation  (naming  relationships)  is  an 


. 
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abstraction  which  turns  a  relationship  between  objects  into 
an  aggregate  object.  Generalization  (naming  classes)  is  an 
abstraction  which  turns  a  class  of  objects  into  a  generic 
object.  All  objects  (individual,  aggregate,  generic)  are 
given  uniform  treatment  in  the  D-A  Model.  The  real  world  is 
modeled  as  a  set  of  aggregation  hierarchies  intersecting 
with  a  set  of  generalization  hierarchies.  Abstract  objects 
(i.e.  generic  and  aggregate  objects)  occur  only  at  the 
points  of  intersection.  In  the  context  of  the  relational 
model  [Codd,  1970;  Codd,  1974],  the  D-A  Model  is  proposed  as 
a  conceptual  model.  Our  aim  is  to  modify  the  generalization 
hierarchy  in  such  a  way  that  all  populations  are  identified 
in  a  systematic  manner  and  a  generic  object  in  the  hierarchy 
consists  of  a  (group  of)  population( s ) . 

In  the  D-A  Model,  for  a  class  of  individual  objects 
corresponding  to  a  generic  or  aggregate  object  G,  the  set  of 
attributes  (or  properties)  which  are  common  to  all 
individual  objects  are  called  G-attr ibutes  (or  G- 
properties).  Clearly,  individual  objects  of  all  generic 
objects  that  are  descendants  of  G  in  the  generalization 
hierarchy  also  have  the  same  attributes. 

Consider  the  same  example  of  employees  of  a  certain 
computer  manufacturing  company.  Figure  4.2  illustrates  one 
particular  decomposition  of  Computer  Scientist  into  lower 
level  generic  objects.  Notice  that  there  are  two  mutually 
exclusive  groups  of  partitions  (also  called  clusters )  of 


61 


Computer  Scientist,  one  is  {Programmer,  Systems  Programmer, 
Systems  Analyst}  and  the  other  is  according  to  the  degree 
obtained.  Now  assume  that  one  also  has  the  "country  in  which 
Ph.D.  was  obtained"  information  about  Computer  Scientists. 
Clearly  one  may  ask  about  the  "population  of  US-educated 
systems  analysts  with  Ph.D.".  In  [Smith  and  Smith,  1977b] 
this  information  is  kept  as  an  attribute  of  objects  in  the 
generalization  hierarchy  and  there  is  no  provision  for 
further  partitioning.  The  reason  for  this  is  that  each 
abstract  object  is  required  to  be  explicitly  named  using 
natural-language  nouns  (e.g.  Programmer,  Systems  Analyst, 
etc.)  and  these  names  help  us  to  relate  our  understanding  of 
the  real  world  with  its  intended  reflection  in  the  relation 
definition.  However,  the  generic  object  "US-educated  systems 
analyst  with  Ph.D."  is  certainly  described  by  a  phrase,  not 
by  a  natural  language  noun  and  yet  we  are  interested  in  this 
particular  object  and  it  has  to  exist  in  the  hierarchy.  Thus 
for  SDB  design  purposes  we  will  take  more  freedom  at  this 
point  and  use  phrases  to  describe  populations.  Figure  4.3 
contains  the  partitioning  of  Systems  Programmer  with  Ph.D. 
and  Systems  Analyst  with  Ph.D.  according  to  the  attribute 
"country  in  which  Ph.D.  is  obtained". 


Now  assume  that  we 
experience"  information 
this  information  may  be 
years  of  programming  exp 


also  have  the  "years  of  programming 
for  Programmers.  Populations  using 
formed,  such  as  "programmers  with  5 
erience"  or  "programmers  with  Ms  and 


. 
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2  months  of  programming  experience",  etc..  At  this  stage  a 
design  decision  problem  appears.  If  the  "years  of  experience 
in  the  company"  information  uniquely  identifies  many 
individuals  by  creating  large  numbers  of  populations  with 
single  individuals,  the  security  is  endangered.  We  assume 
that  SDB  designers  make  the  decomposition  decisions  using 
their  knowledge  of  users'  needs,  i.e.  if  there  are  very  many 
populations  each  with  few  individuals,  then  the  designers 
will  cut  down  the  number  of  populations  and  still  preserve  a 
good  model  of  the  real  world.  However  this  does  not  mean 
that  initial  design  decisions  cannot  be  changed,  indeed,  if 
a  need  arises,  some  mechanisms  will  be  available  to  the  DBA 
so  that,  with  an  assessment  of  the  security  of  protected 
information,  the  decomposition  of  objects  may  be  changed 
some  time  later.  Figure  4.4  shows  one  particular  design 
decision  about  the  usage  of  "years  of  programming 
experience"  information  for  decomposing  the  object 
Programmer . 

In  the  SDB,  statistical  information  about  individuals 
in  a  population  is  made  available  to  users.  Clearly,  each 
abstract  object  in  the  D-A  model  forms  a  population  of 
individual  objects.  We  call  smallest  no nde compos able  group 
of  individuals  an  Atomic  Population  ( A-population) .  For 
example,  in  Figure  4.5,  ASSIGNMENT-IN-DATABASE-PROJECT, 
PROGRAMMER  and  PROJECT  are  A-popula tions .  In  order  to 
preserve  the  indivisibility  property  of  A-populations  the 
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following  rule  is  applied:  any  population  corresponding  to 
any  abstract  object  in  the  model  is  composed  of  mutually 
exclusive  A-populations  that  explicitly  exist  in  the  model 
(Rule  1 )  .  The  restriction  that  A-populations  explicitly 
exist  in  the  conceptual  model  may  bring  limitations  to  the 
richness  of  the  SDB.  However,  it  is  needed  to  provide 
systematic  assurance  of  the  security  of  protected 
information  in  the  SDB. 
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COMPUTER-SCIENTIST 


SYSTEMS-  SYSTEMS-  SYSTEMS- 
PROG-  PROG-  PROG- 
WITH-BS  WITH-MS-  WITH-PHD 


PROG-  PROG- 
WITH-  WITH¬ 
ERS  MS 


SYST-  SYST-  SYST- 
A-BS  A-MS  A-PHD 


Figure  4.2.  Decomposition  of  the  generic  object  Computer 
Scientist 
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WITH-PHD 
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PROGRAMMER- 
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WITH-PHD 


EDUCATED- 

SYSTEMS- 

PROGRAMMER- 

WITH-PHD 


EDUCATED- 

SYSTEMS- 

ANALYST- 

WITH-PHD 


Figure  4.3.  Decomposition  of  the  generic  object  Computer 
Scientist  with  PhD 


65 


Figure  ^Generalization  Hierarchy  for  Computer  Scientist. 
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SECRETARY 


SYSTEMS- 

PROGRAMMER 


SYSTEMS- 

ANALYST- 

WITH-0-4- 

YEARS- 

EXPERIENCE 


Figure  4.5  Database  of  employees,  projects  and  assignments 
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The  Entity-Relationship  (E-R)  Model: 

The  E-R  Model  adopts  the  view  that  the  real  world 
consists  of  entities  and  relationships .  An  entity  is  a  thing 
which  can  be  distinctly  identified.  A  relationship  is  an 
association  among  entities.  Entities  are  classified  into 
different  entity  sets  such  as  EMPLOYEE,  PROJECT  and 
DEPARTMENT.  Similarly  relationships  are  classified  into 
relationship  sets  such  as  PROJECT-WORKER  and  DEPARTMENT- 
EMPLOYEE. 

In  the  SDB,  entities  and  relationships  correspond  to 
individuals,  and  each  entity  set  or  relationship  set  is  a 
population.  Statistical  information  about  values  in 
attribute-value  pairs  of  entities  or  relationships  are 
revealed  to  users.  However,  in  order  to  define  A-populations 
and  to  enforce  security  constraints,  we  need  additional 
rules  and  constructs  as  described  below. 

1)  Some  means  are  needed  to  identify  the  A-populations 
that  a  population  contains.  For  example,  the  fact  that 
entity  set  MALE-PERSON  is  a  subset  of  the  entity  set  PERSON 
should  be  easily  accessible  to  the  database  system.  This  is 
needed,  for  example,  when  constraints  applied  to  individuals 
in  an  A-population  are  also  applied  to  all  populations  that 
include  the  same  A-population  (the  details  of  this 
requirement  will  be  discussed  in  the  later  sections). 


2)  Each  entity  set  or  relationship  set  must  be  composed 
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of  some  mutually  exclusive  entity-sets  or  relationship  sets 
( rule  1 ) . 

3)  The  protection  mechanism  of  the  SDB  should  be  able 
to  locate  all  populations  that  contain  a  given  A-population. 
This  is  needed,  for  example,  while  processing  insertions, 
deletions  or  updates  of  individuals.  In  the  E-R  Model, 
locating  all  populations  containing  a  given  A-population 
requires  additional  structures  or  rules. 

Finally,  in  the  E-R  Model,  the  job  of  defining  the 
allowable  types  of  statistical  queries  in  a  systematic 
manner  relies  on  the  DBA  whereas  this  task  is  easier  in  the 
D-A  model  because  of  its  hierarchical  structure  (see  Section 
4.2.4) . 

The  Extended  Relational  Model  (RM/T): 

In  RM/T,  there  are  entities  and  entity  types  classified 
by  whether  they 

1)  fill  a  subordinate  role  in  describing  entities  of 
some  other  type,  in  which  case  they  are  called 
characteristic , 

2)  fill  a  superordinate  role  in  inter-relating 
entities  of  other  types,  in  which  case  they  are  called 
associative , 

3)  neither  of  the  above,  in  which  case  they  are  called 
kernel . 


Using  these  entity  types,  the  semantic  structures  defined 
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are  characteristic  tree,  association  graph,  cartesian 
aggregation,  generalization  and  cover  aggregation.  Cartesian 
aggregation  is  the  aggregation  abstraction  of  the  D-A  Model. 
Below  each  of  the  new  semantic  structures  of  RM/T  and 
related  SDB  issues  as  to  how  to  obtain  populations  and 
indivisible  A-populations  are  discussed. 

a)  Characteristic  tree.  The  characteristic  entity  types 
that  provide  description  of  a  given  kernel  entity  type  form 
a  characteristic  tree.  Example  4.2  below  is  from  [Codd, 

1979]  . 


EMPLOYEE 


CHANGE 

Figure  4.6.  A  Characteristic  Tree 

Example  4.2  EMPLOYEES  (a  kernel  entity  type)  have  a  JOB- 
HISTORY  (characteristic  entity  type  subordinate  to  EMPLOYEE) 
whose  immediate  properties  are  DATE-ATTAINED-POSITION  and 
NAME -OF- POSITION  (see  Figure  4.6).  This  information  is 
augmented  by  SALARY-HISTORY  (characteristic  entity  type 


•r 


70 


subordinate  to  JOB-HISTORY)  whose  immediate  properties  are 
DATE-OF- SALARY-CHANGE  and  NEW-SALARY.  The  mapping  between 
entities  in  the  parent  and  child  nodes  is  one-to-many,  e.g. 
one  EMPLOYEE  has  many  JOB-HISTORY  entities  and  one  JOB- 
HISTORY  entity  has  many  SALARY- HI STORY  entities. 

Each  of  EMPLOYEE,  JOB-HISTORY  and  SALARY- HI STORY  nodes 
forms  a  population.  A-popul at ions  of  these  populations  may 
be  formed  by  decomposing  them  (as  generalization 
abstractions)  using  their  properties,  their  mapping  to  the 
parent  nodes,  etc..  For  example,  SALARY-HISTORY  may  be 
decomposed  using  (a)  EMPLOYEE  individuals,  (b)  NAME-OF- 
POSITION  of  JOB-HISTORY,  (c)  date,  (d)  salary  ranges.  Notice 
that,  for  the  SDB,  if  the  decomposition  of  a  population  in 
the  characteristic  tree  is  effected  by  its  parent 
populations,  some  limitations  may  be  brought  on  the 
information  revealed  to  users  in  order  to  prevent 
compromise . 

b)  Association  Graph.  An  associative  entity  inter¬ 
relates  entities  of  other  types,  and  this  inter-relation  is 
represented  by  the  association  graph  in  the  RM/T.  If  the 
association  among  individual  entities  is  to  be  protected 
then  some  limitations  should  be  brought  on  the  information 
revealed  to  users  about  the  associative  entities  or  the 
entities  that  they  inter-r elate . 


c)  Generalization.  Codd  [Codd,  1979]  renames  the 
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generalization  abstraction  of  the  D-A  Model  as  Unconditional 
Generalization  Inclusion  (UGI),  and  also  describes  an 
abstraction  called  Alternative  Generalization  Inclusion 
(AGI),  which  is  an  alternative  or  conditional  inclusion  of 
entities  of  an  entity  type  into  some  other  entity  types. 
Clearly,  the  only  structural  difference  between  the  UGI  and 
the  AGI  is  that  the  AGI  decomposes  a  population  P  into 
mutually  exclusive  populations  that  are  one  level  above  P  in 
the  hierarchy.  Thus  controlling  inferences  for  the  AGI  is 
the  same  as  the  UGI. 


d)  Cover  Aggregation.  Cover  aggregation  is  an 
aggregation  in  which  a  subset  of  entities  of  the  same  type 
forms  another  entity  with  a  different  entity  type.  For 
example,  a  CONVO Y-OF-SHIPS  is  a  cover  entity  of  entity  type 
SHIP,  or  a  CLUB  that  some  people  belong  to  forms  a  cover 
aggregate  of  PEOPLE. 


For  the  SDB  design,  cover  aggregate  entity  types 
partition  the  group  of  entities  resulting  in  smaller 
populations.  Intersection  of  these  smaller  populations  gives 
A-popula tions  of  covered  individuals. 

4.2.4  Statistical  Information  Related  to  each  Population 


For  each  population  in  the  conceptual  model  of  the  SDB, 
the  following  is  defined.  (Note  that  (a)  and  (b)  below  are 
defined  by  the  schema  and  the  Query  Mapping  Function, 
r espect ive ly , in  the  formal  definition  of  the  SDB  in  Section 
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4.2.2)  . 

(a)  the  properties  of  the  population"*"  for  which 
statistical  information  is  to  be  revealed,  e.g.  SALARY  or 
ABSENT-DAYS  for  the  population  EMPLOYEE, 

(b)  whether  COUNT  queries  requesting  the  number  of 
individuals  in  the  population  are  permitted,  and 

(c)  the  allowable  types  of  statistical  information  for 
each  property  of  the  population  which  may  be  one  or  more  of 

MEAN, 

SUM, 

MAX,  MIN,  MEDIAN,  K-LARGEST  (order  statistics) 
VARIANCE, 

STANDARD  DEVIATION, 
k-MOMENT,  k=2 , 3 , . . . 

Clearly,  if,  in  Figure  4.4,  SUM  query  for  the  SALARY 
property  of  PROGRAMMER  and  SYSTEMS-PROGRAMMER  is  allowed 
then  SUM  information  for  the  SALARY  of  SYSTEMS-ANAL YST  is 
deducable.  Thus,  unless  individual  security  needs  of 
populations  require  otherwise,  the  following  two  rules  are 
found  necessary  for  the  uniformity  of  the  revealed 
statistical  information  and  richness  of  the  database. 

Rule  2)  The  allowable  set  of  statistical  query  types 

+By  "the  property  of  a  population",  "the  property  of 
individuals  in  the  population"  is  meant. 
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should  be  identical  for  the  same  property  of  all  populations 
in  the  same  cluster.  (Subpopulations  created  by  a  mutually 
exclusive  decomposition  of  a  population  in  the 
generalization  hierarchy  form  a  cluster). 

Consider  Figure  4.4.  Assume  that  SALARY  is  an  attribute 
of  all  the  objects  in  the  hierarchy  and  SUM  query  is  allowed 
for  SALARY  of  Programmer.  SUM  query  should  be  also  allowed 
for  SALARY  of  Systems  Programmer  and  Systems  Analyst. 

Rule  3)  The  allowable  set  of  statistical  query  types 
for  a  property  of  any  population  should  be  the  subset  of  the 
allowable  set  of  query  types  for  the  same  property  (if  it 
exists)  of  its  father  population  in  the  generalization 
hierarchy. 


Consider  Figure  4.4.  Assume  the  statistical  query  SUM 
of  SALARY  is  allowed  in  populations  Programmer,  Systems 
Programmer,  Systems  Analyst  and  statistical  query  MEDIAN  of 
SALARY  is  allowed  in  populations  Computer  Scientist  with  Bs, 
Ms  and  PhD.  Statistical  queries  SUM  and  MEDIAN  are  allowed 
for  the  population  of  Computer  Scientist. 


Since  COUNT  queries  do  not  directly  reveal  information 
about  protected  properties  of  populations,  applying 
protection  measures  down  to  A-populations  may  unnecessarily 
restrict  the  richness  of  the  SDB .  Thus,  a  security  atom 
population  ( SA-population)  is  defined  to  be  the  largest 


population  such  that  no  statistical  information  about  any 
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property  of  any  of  its  proper  subsets  can  be  revealed  to 
users.  Notice  that  an  SA-population  contains  one  or  more  A- 
populations.  The  set  of  values  to  be  protected  for  each 
property  in  a  SA-population  is  called  a  security  atom  value 
set  (SA-value  set).  The  following  example  illustrates  SA- 
popul  ations  . 

Example  4.3  Consider  Figure  4.4.  Assume  there  are  only  two 
protected  properties,  SALARY  and  ABSENT-DAYS,  and 

(i)  for  all  populations,  COUNT  query  is  allowed. 

(ii)  SUM  query  for  SALARY  and  ABSENT-DAYS  is  allowed 
for  populations  al,  a2,  a3  and  a4 .  MEDIAN  query  for  SALARY 
is  allowed  for  populations  al,  a5,  a6  and  a7.  Also,  SUM 
query  for  ABSENT-DAYS  is  allowed  for  populations  al0,  all 
and  al2.  Clearly,  populations  a8,  a9,  al3,  al4,  al5,  al6, 
a23  and  a24  contain  nondecomposable  SALARY  information 
revealed  to  users.  Similarly,  populations  al0,  all,  al2,  a3 
and  a4  contain  nondecomposable  ABSENT-DAYS  information 
revealed  to  users.  The  intersections  of  these  populations 
will  give  SA-populations  al7,  al8,  al9,  a20,  a21,  a22,  al3, 
al4,  al5,  al6,  a23  and  a24.  Notice  that  a  SA-population  may 
contain  one  or  more  A-popul at ions  (e.g.  a23  and  a24) . 

4.2.5  Security  Constraints 

Dynamics  of  the  real  world  or  the  existence  of  complex 
relationships  between  populations  may  lead  the  DBA  to  impose 
constraints  on  the  security  related  information  in 


\ 
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populations.  The  DBA  should  be  able  to  state  the  conditions 
under  which  any  statistical  query  about  any  protected 
property  of  a  population  may  be  reported  to  users.  Since  the 
aim  of  this  section  is  only  to  provide  the  DBA  with  the 
power  to  do  so,  in  general  three  types  of  constraints  will 
be  distinguished.  (Defining  these  constraints  is  very  much 
dependent  on  the  specific  environment  and  we  are  unable  to 
give  more  detailed  analysis  and  structural  specifications  of 
the  constraints  as  done  by  [Hammer  and  McLeod,  1975]  for 
semantic  integrity  constraints). 

1)  Security  Atom  Constraints  ( SA-constra ints )  apply  to 
the  SA-value  set  in  a  SA-population  A  and  all  populations 
that  contain  A.  An  example  is:  sum  salary  information  must 
not  include  the  salary  x  of  employee  a  in  SA-value  set  w 
until  there  is  another  employee  hired  or  fired. 

2)  Global  constraints  (Type  1)  apply  to  the  individuals 
in  a  population  A  and  individuals  of  all  or  some  of  the 
populations  in  the  hierarchy  that  contain  A. 

Consider  Figure  4.4  and  the  example  4.1  given  in  the 
introduction.  Assume  user  group  u  is  allowed  to  access  down 
to  Systems  Analyst  in  the  hierarchy.  Now  the  hiring  of  two 
new  Systems  Analysts  with  Ms  and  with  total  salary  $27K 
should  not  be  incorporated  into  the  population  Systems 
Analyst.  However  if  the  range  of  salaries  of  Computer 
Scientists  with  Ms  include  $14K  due  to  its  other  child 


76 


populations,  then  the  new  change  may  be  incorporated  into 
the  populations  Computer  Scientist  with  Ms  and  Computer 
Scientist  (if  other  constraints  are  also  satisfied). 

3)  Global  constraints  (Type  2)  apply  to  the  individuals 
of  a  population  A  and  to  individuals  of  another  population  B 
in  a  different  part  of  the  hierarchy. 

Example  4.4  Consider  the  database  of  employees,  projects 
and  assignments  in  Figure  4.5.  It  is  known  that  at  least  ten 
programmers  and  one  systems  analyst  with  more  than  ten  years 
working  experience  are  involved  in  the  database  development 
project.  Assume  the  user  u  also  knows  that  a  project  leader 
must  be  a  sy stems-ana lys t  with  PhD.  Now  if  COUNT  queries  of 
S YS TEMS - AN AL YST-WI TH-MO RE -THAN- 1 0-YEARS -EXPERIENCE  and 
ASSIGNMENT-IN-DATABASE-PROJECT  return  1  and  11, 
respectively,  and  the  user  u  knows  a  sys tems-analy st  x  with 
more  than  10  years  experience  then  the  user  u  discloses  that 
x  is  the  project  leader  of  the  database  development  project 
and  also  has  a  PhD.  To  prevent  this  disclosure,  type  2 
global  constraints  applied  to  SYSTEMS-ANALYST-WITH-MORE- 
T HAN- 1 0- YEARS- EX PERI ENCE  and  ASSIGNMENT-IN-DATABASE-PROJECT 
may  state  that  if  COUNT  information  of  these  two  populations 
are  smaller  than  al  and  10+a2,  respectively,  where  al  and  a2 
are  properly  chosen  small  integer  constants,  then  COUNT 
information  of  both  of  the  populations  are  not  revealed  to 


users . 
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Example  4.5  Consider  Figure  4.5.  Assume  two  programmers  are 
hired.  Now  if  COUNT  information  of  both  PROGRAMMER  and 
ASSIGNMENT-IN-DATABASE-PROJECT  increase  by  two  then  the  new 
programmers  are  assigned  to  the  database  development 
project.  If  this  information  is  to  be  protected  then  type  2 
global  constraints  applied  to  child  populations  of 
ASSIGNMENT  may  state  that  new  assignments  in  child 
populations  of  ASSIGNMENT  are  reported  only  when  there  are 
new  assignments  in  two  or  more  projects. 

4.3  A  Statistical  Security  Management  Facility 

The  formal  definition  of  the  SDB  in  Section  4.2.2 
includes  type-1  operations  for  high-level  schema 
modifications  and  SDB  Knowledge  Base  with  its  own  set  of 
operations.  For  the  simplicity  and  the  efficiency  of  the 
design,  the  SDB  design  in  this  section  will  not  include 
type-1  operations,  and  a  very  limited  version  of  the 
Knowledge  Base  will  be  introduced.  However,  an  SDB  design 
that  permits  type-1  operations  and  a  more  general  SDB 
Knowledge  Base  will  be  discussed  in  Chapter  5. 

In  this  section,  a  statistical  security  management 
facility  (SSMF)  with  three  principal  components  is  proposed. 

(1)  A  Population  Definition  Construct  (PDC) 

(2)  A  User  Knowledge  Construct  ( UKC ) 

(3)  A  Constraint  Enforcer  and  Checker  (CEC). 

The  PDC  of  a  population  contains  information  about  the 
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population,  related  security  constraints,  changes  of  the 
population,  etc.,  in  order  to  achieve  effective  protection. 
The  UKC  of  a  user  group  is  designed  to  record  users' 
additional  knowledge  and  SA-constr aints .  Finally,  the  CEC 
consists  of  several  algorithms  designed  to  keep  the  PDCs  and 
UKCs  up-to-date,  to  enforce  the  security  constraints  and  to 
help  the  DBA  in  security-related  decision  problems. 

Population  Definition  Construct: 

For  each  population  P,  there  is  one  PDC  which  contains 
the  following  information. 

(a)  Description  of  the  population  and  its  parent,  child 
and  sibling  populations. 

(b)  Lowest  permissable  user  group  level. 

(c)  Information  as  to  how  changes  are  included  in  P. 

(d)  Allowable  statistical  query  types  for  each  property 
of  P. 

(e)  Global  constraints  of  P. 

(f)  If  P  is  an  SA-population  then  description  of  SA- 
constraints  for  each  SA-value  set  of  P. 

(a) ,(d)  and  (f)  are  self-explanatory. 

(b)  Assume  user  groups  are  classified  by  levels  such 
that  user  groups  with  higher  levels  have  more  access  power 
to  the  database  than  the  user  groups  with  lower  levels. 
Lowest  permissable  user  group  level  is  a  level  n  such  that 
user  groups  with  level  m  >.  n  can  access  that  population. 
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Population  PROGRAMMER 

description  [Phrase], 

parent  populations  [COMPUTER-SCIENTIST], 

child  populations  [( PROGRAMMER- WITH-BS ,  PROGRAMMER- 

WITH-MS ) ,  (PROGRAMMER-WITH-0-4-YEARS-EXPERIENCE, 
P  ROGRAMMER-WI TH- 5- 1 0- YEARS-EXPERI ENCE , 
PROGRAMMER-WITH-1 1-0R-M0RE-YEARS-EXPERIENCE ) ] , 
other  populations  in  the  same  cluster  [SYSTEMS- 
PROGRAMMER,  SYSTEMS-ANALYST] , 
lowest  permissible  user  group  level  2, 
allowable  query  COUNT, 
changes  processed  in  PAIRS, 
protected  property  SALARY, 
allowable  query  SUM, 
protected  property  ABSENT-DAYS, 
allowable  query  MEDIAN, 
global  constraints 
constraint  1 

description  [Phrase], 
call  VIGL-CS1, 
constraint  2 

description  [Phrase], 
call  VI0L-CS2, 

end  . 


Figure  4.7.  The  PDC  of  Programmer 
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(c)  Changes  due  to  the  dynamics  of  the  real  world  may 
be  processed  in  many  ways.  How  these  changes  are  handled  is 
described  in  the  PDC. 

(e)  Global  constraints  may  be  static  or  dynamic.  They 
may  evolve  and  change  as  the  DBA  modifies  them,  e.g.  a 
manager  changes  companies,  thus  extends  his  knowledge  and 
the  DBA  should  take  necessary  actions.  For  each  global 
constraint,  the  PDC  contains  the  description  of  the 
constraint  and  a  call  for  a  routine  in  the  case  of  violation 
of  the  constraint.  Figure  4.7  contains  the  PDC  of  Programmer 
in  the  generic  hierarchy  described  in  Figure  4.4. 

User  Knowledge  Construct: 

For  each  user  group  u,  the  UKC  records  the  users' 
additional  knowledge  about  individuals  in  the  SDB.  Figure 
4.8  contains  the  UKC  of  user  group  u  for  the  generic 
hierarchy  described  in  Figure  4.4. 

Assume  user  group  u  is  at  the  3rd  level  which  can 
access  all  populations  in  the  hierarchy. 

Users  in  user  group  u  can  identify  the  individuals  that 
are  updated,  inserted  or  deleted  from  the  population 
PROGRAMMER  (e.g.  the  newly  inserted,  deleted  or  updated 
programmer  in  the  population  PROGRAMMER  is  known  by  the  user 
group  u).  For  each  population,  this  information  is  defined 
after  the  keyword  "identifiable  dynamics"  in  the  UKC. 
Clearly,  protection  measures  to  be  applied  should  be 
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different  for  a  user  group  which  identifies  only  inserted 
individuals  of  a  population  and  a  user  group  which 
identifies  both  inserted  and  deleted  individuals  of  the  same 
population.  (There  may  be  other  variations,  for  example, 
users  in  user  group  w  may  identify  updated  individuals  when 
the  update  is  from  Systems  Programmer  to  Systems  Programmer, 
etc . . ) 

Each  SA-popul ation  contains  one  SA-value  set  for  each 
of  its  properties.  Dynamics  of  a  SA-popul ation  (i.e. 
inserted,  deleted,  updated  individuals)  are  recorded  in  a 
list  called  the  change  sequence  in  the  order  of  occurrences 
of  changes.  (This  list  may  be  kept  separately  if  the 
expected  number  of  changes  is  large).  Depending  on  the  type 
of  statistical  information  revealed,  the  change  sequence  is 
used  in  several  procedures  to  decide  whether  the  security  of 
individuals  and  the  protected  information  are  in  danger. 

For  security  purposes,  changes  may  be  processed  in 
groups, say  triplets.  In  such  cases,  some  individuals  may  be 
waiting  to  be  processed;  these  individuals  are  described  and 
maintained  in  SA-cons traints .  For  each  SA-value  set,  users 
may  know  global  upper  or  lower  bounds  of  the  property  values 
of  individuals,  and  upper  or  lower  bounds  for  some  specific 
individuals.  For  example,  in  Figure  4.8,  users  in  user  group 
u  know  that  the  salary  of  programmer  Ian  Munroe  is  less  than 


$18K. 


, 
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USER  GROUP  U  [user-id,  user-id,  user-id] , 
user  group  level  3; 
population  COMPUTER-SCIENTIST, 

identifiable  dynamics  INSERTION,  DELETION,  UPDATE, 
population  PROGRAMMER, 

identifiable  dynamics  INSERTION,  DELETION,  UPDATE, 
population  PROG RAMMER- WITH- 0-4 -YEARS -EXPERIENCE 

[SA-POPULATION]  , 

identifiable  dynamics  INSERTION,  DELETION,  UPDATE, 
protected  property  SALARY, 

security  atom  constraint:  {JOHN  DOE}  is  not 
included, 

change  sequence  parameters 

active  individuals  set  {(STEVE  HART, 

ROCK  HO, 20) , ... , (JOHN  GRAY, , 3) } , 
reachability  constant  0.1, 
largest  reachability  set  size  20, 
known  value  set  {(JOHN  SO) , . . . , (ALAN  POE)}, 
known  global  upper  bound  $34K, 
known  global  lower  bound  $8K, 

known  upper  bounds  set  {(IAN  MUNROE , $ 18K ) , . . . , 
(GEORGE  HO, $20K) } , 
known  lower  bounds  set  {0}, 
change  sequence  {[(JIM  JOE, INSERT), 

(JACK  YU, DELETE) , (OLD  MEDIAN, $ 15K) , 
(NEW  MEDIAN, $14K) j,  ...  , 

[(PHILIP  HO, DELETE) , (JACK  FU, DELETE), 
(NEW  MEDIAN, $18K) ]} , 
protected  property  ABSENT-DAYS, 

security  atom  constraint:  {STEVE  HUDSON}  is 
not  included, 

change  sequence  parameters 

active  individuals  set  {(STEVE  HART,, 15), 

. . . , (JOHN  GRAY, , 7) } , 
reachability  constant  0.2, 
largest  reachable  set  size  15, 
known  value  set  {0}, 
known  global  upper  bound  90, 
known  global  lower  bound  0, 
known  upper  bounds  set  {0}, 
known  lower  bounds  set  {0}, 
change  sequence  {[(JIM  JOE, INSERT )] , 

[(JACK  YU, DELETE) , (CHEN  TU, DELETE), 

(OLD  SUM, 450) , (NEW  SUM , 3 45 )],...} , 
pop ul at ion  PROG RAMMER- WI TH- 5-1 0- YEARS- EX PERI ENCE 

[SA-POPULATION] , 


end . 


Figure  4.8.  The  UKC  of  user  group  u  for  the  generic  hierarchy 
described  in  Figure  4.4. 
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Change  sequence  parameters  are  described  as 
reachability  constant  wl  and  largest  reachable  set  size  w2 
(see  Section  3.3).  For  each  change  sequence,  an  active 
individuals  set  is  also  maintained.  (A  vertex  is  inactive  if 
it  corresponds  to  an  individual  deleted  from  the  population; 
it  is  called  active  if  it  corresponds  to  an  individual 
previously  inserted  but  not  yet  deleted  from  the 
population).  An  active  individuals  set  contains  one  set 
element  for  each  connected  component  with  one  or  two  active 
individuals.  Each  set  element  contains  the  names  of  the 
active  individuals  and  the  number  of  vertices  in  that 
connected  component  (i.e.  the  reachability  set  size).  For 
example,  in  Figure  4.8,  the  information  graph  of  the 
population  of  Programmer  with  0-4  years  of  experience 
contains  a  connected  component  with  two  active  vertices  for 
individuals  Steve  Hart  and  Rock  Ho,  and  the  number  of 
vertices  in  that  connected  component  is  20. 


Constraint  Enforcer  and  Checker: 

The  CEC  is  composed  of  several  algorithms.  It  utilizes 
p DCs  and  UKCs  to  perform  the  following  two  basic  tasks: 

(a)  For  each  statistical  query,  it  is  invoked  to  find 
out  the  global  and  SA-constraints  by  tracing  the  related  PDC 
and  the  UKC,  to  enforce  these  constraints  by  executing  the 
related  procedures  (and  thus  altering  the  answer  to  the 
user's  statistical  query,  if  necessary); 
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(b)  For  each  change  (i.e.  insertion,  deletion  or  update 
of  individuals)  in  the  populations  of  the  D-A  Model,  it  is 
invoked  to  modify  the  constraints,  to  decide  whether  to 
process  (i.e.  to  include  into  users'  statistical  queries) 
the  change  for  each  SA-value  set,  and  (if  the  change  is 
processed)  to  modify  the  related  change  sequence  and  its 
parameters  for  each  user  group  u. 

Other  than  above,  the  CEC  helps  the  DBA  in  several 
security-related  decision  problems  by  providing  lists  of 
individuals  whose  security  is  threatened  under  events 
described  below. 

(1)  Changes  in  user  groups.  User  groups  may  join, 
decompose,  or  users  may  move  from  one  user  group  to  another. 
In  these  cases,  additional  knowledge  of  a  user  group  may 
increase,  and  further  disclosed  information  is  then  decided 
by  the  CEC  using  the  UKC  of  the  user  group  and  change 

s  equences . 

(2)  Changes  in  the  conceptual  model  such  as 
decomposing  a  population  or  re-parti tioning  a  population.  In 
these  cases,  the  CEC  finds  SA-  and  A-populations ,  re¬ 
arranges  UKCs ,  modifies  security  measures  and  reports 
disclosures. 

( 3 )  Changes  in  users  1 2 3  additional^ knowledge  s uch  as  a 
modified  known  value  set  or  an  updated  known  upper  bounds 
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Figure  4.9  Statistical  Database  Model 
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set.  In  these  cases,  further  disclosures  are  decided  by 
tracing  the  change  sequences  and  considering  possible 
inferences  discussed  in  Section  4.4. 

Another  job  of  the  CEC  is  to  modify  and  maintain 
several  security  measures  related  to  the  change  sequence  of 
each  SA-value  set  in  order  to  give  the  DBA  a  measure  of  how 
secure  the  system  really  is  at  a  particular  time.  Some  of 
these  measures  are  discussed  in  Section  3.3. 

The  general  scheme  of  the  SSMF  is  depicted  in  Figure 
4.9.  The  CEC  utilizes  the  conceptual  model,  PDCs  and  UKCs  to 
enforce  security  constraints  and  modify  statistical  queries. 
Individual  insertions,  deletions  and  updates  into 
populations  in  the  conceptual  model  are  intercepted,  and 
modifications  of  security  constraints,  change  sequences  and 
their  parameters  are  done  by  the  CEC.  In  the  case  of 
disclosures,  the  DBA  is  notified,  and  security-related 
reports  such  as  the  values  of  security  measures,  the  number 
of  constraints  in  effect,  the  number  of  individuals  not 
included  into  statistical  queries  and  the  introduced  error, 
etc.,  are  reported  by  the  CEC. 

4.3.1  Implementation  Considerations 

In  this  section,  implementation  and  maintenance  issues 
of  security-related  structures  are  discussed. 


To  answer  a  statistical  query,  conventional  database 
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query  processing  is  performed  to  obtain  the  answer,  and  then 
related  security  constraints  (i.e.  global  and  SA- 
constraints)  are  enforced.  To  retrieve  SA-constraints,  the 
user  group  of  the  user  is  determined  and  the  related  UKC  is 
accessed.  To  retrieve  global  constraints,  the  PDC  of  the 
population  described  in  the  user's  query  is  accessed.  Thus 
the  additional  time  overhead  to  the  conventional  database 
query  processing  involves  two  accesses  and  the  processing  of 
security  constraint  routines. 

To  process  changes  (i.e.  insertion,  deletion  or  updates 
of  individuals),  in  addition  to  the  conventional  database 
query  processing,  the  related  PDC  is  accessed  to  modify 
global  constraints,  and  then,  for  each  user  group,  the  UKC 
is  retrieved  to  update  SA-constraints,  the  user  group's 
knowledge  and  change  sequence  and  its  parameters. 

Below  possible  operations  on  the  Known  Value  Set  (KVS), 
the  Known  Upper  and  Lower  Bounds  Sets  (KUBS  and  KLBS) , 
change  sequences  and  their  parameters  in  the  UKC  are 
discussed.  The  following  notation  is  used  for  an  SA- 
population  with  m  individuals  and  p  properties.  The  ith 
property  value  of  individual  in  the  population  is  denoted 

by  Pi j •  The  KUBS  of  the  i  property  for  the  user  group  G  is 
denoted  by  KUBS(i,G).  v^  is  the  known  upper  bound  value  for 
the  ith  property  of  individual  j  in  the  SA-population .  v^j  is 
either  an  element  of  KUBS(i,G)  or  equivalent  to  the  known 
global  upper  bound  for  the  i  property. 
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The  operations  on  the  KVS  are  set  membership  check, 
insertion  and  deletion.  Clearly,  the  KVS  is  a  dictionary 
[Aho  et  al . ,  1974].  Thus  a  balanced  tree  (a  2-3  tree  or  an 
AVL  tree)  may  be  used  to  process  an  operation  of  the  KVS  in 
0(locf2m)+  time,  where  m  is  the  number  of  individuals  in  the 
population . 


The  KUBS(i,G)  should  support  the  followinq  operations: 

a)  Search  for 

1)  v^_.  for  a  qiven  j  , 

2)  all  j  such  that  £  c,  c  is  a  constant, 

b)  insertion  of  v^.  into  KUBS(i,G)  for  a  qiven  i, 

c)  deletion  of  v^  from  KUBS(i,G)  for  a  qiven  i, 

d)  update  of  v^  for  a  qiven  j. 

Operation  a-2  may  be  required  to  assess  how  "close"  the 
users'  knowledqe  of  the  upper  bound  value  v^  to  p^  is.  The 
operations  on  the  KLBS  are  similar. 

For  the  KUBS(i,G),  operation  d  is  equivalent  to  two 
operations,  operations  b  and  c.  Thus,  operations  a-1,  b,  c 
and  d  imply  a  dictionary,  and  the  balanced  tree  T^  with  each 
leaf  node  j  (j  is  an  individual  in  the  population) 
containinq  v^^  G  KUBS(i,G)  may  be  used.  For  operation  a-2, 
if  it  is  executed  rarely  then  a  sequential  search  is 
sufficient  otherwise  another  balanced  tree  T^  (a  2-3  tree  or 
an  AVL  tree)  may  be  used,  whose  external  node  is  a  linked- 

+0-notation  is  described  in  [Aho  et  al . ,  1974]. 
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list  containing  individuals  j  with  the  same  (v^-p^j)  values 
,  and  each  external  node  is  linked  to  the  external  node  to 
its  right.  Thus,  if  there  are  n^  different  values 

in  the  SA-population,  operation  a-2  reguesting  all  j  such 
that  ( v-j_ j“Pj_  j )  ^  c  requires  Ollog^n^)  comparisons  for 
determining  the  external  node  e  containing  individuals  j 
with  ( v  j_  j  ~Pj_  j  )  =c  •  and  then  sequential  traversal  of  the 
external  nodes  to  the  left  of  the  external  node  e  is 
sufficient  to  complete  operation  a-2.  If  each  external  node 
i  of  containing  v^j  6  KUBS(i,G)  also  contains  a  pointer 
to  the  individual  j's  corresponding  node  in  then 
maintenance  of  T ^  due  to  operations  b,  c  and  d  take 
OClog^n^),  0(1)  and  0(log2n^)  time,  respectively.  Thus 
operations  a-1,  a-2,  b,  c  and  d  take  Odog^m)  time. 


The  change  sequence  is  consulted  durinq  checking  for 
disclosure,  and  modified  during  a  change  in  the  population. 
The  operations  on  the  change  sequence  are: 

1)  inserting  a  new  change  into  the  change  sequence  and 
modifying  the  change  sequence  parameters, 

2)  finding  a  certain  reachability  set  from  the  change 
sequence , 

3)  finding  all  reachability  sets  from  the  change 
sequence . 


The  change  sequence  may  be  implemented  as  a  list. 
Including  a  new  change  into  the  change  sequence  can  then  be 
achieved  by  simply  inserting  a  node  at  the  front  of  the 


' 
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list.  To  modify  the  active  individuals  set,  a  search  in  the 
active  individuals  set  is  needed  for  each  deleted 
individual.  If  there  are  few  active  individuals  then  a 
linear  search  is  sufficient.  In  the  case  of  large  number  of 
active  individuals,  this  search  may  be  faster  by 
implementing  a  dictionary,  e.g.  by  keeping  the  active 
individuals  in  the  leaves  of  the  balanced  tree  .  Each 
active  individual  in  contains  a  pointer  to  the  other 
active  individual  in  the  same  connected  component  of  the 
information  graph.  Thus  maintenance  of  the  active 
individuals  set  can  be  performed  in  OClog^k)  time,  where  k 
is  the  number  of  individuals  in  the  active  individuals  set. 
The  reachability  constant  w^  can  be  updated  without  extra 
time  complexity  if  each  active  individual  in  the  leaves  of 
T^  also  contains  the  name  of  its  reachability  set.  For  the 
largest  reachable  set  size  maintaining  the  name  of  the 

largest  reachable  set  and  keeping  in  each  leaf  node  of  T^ 
the  number  of  individuals  in  the  related  reachability  set  is 
sufficient  to  maintain  without  any  extra  time  complexity. 
Thus  operation  1  can  be  processed  in  oClog^k)  time,  where  k 
is  the  size  of  the  active  individuals  set. 

All  reachability  sets  of  an  SA-popula tion  may  be  found 
by  a  single  sequential  backward  traversal  of  the  change 
sequence  as  follows: 

(i)  Create  one  reachability  set  Rs^  for  each  element  in 
the  active  individuals  set.  Also  for  each  active  individuals 
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set  element  havinq  a  sinqle  active  individual,  r  ,  or  two 

3. 

active  individuals,  and  r  ,  create  an  element  for  a  set  A 
(initially  empty)  as  (r  ,  ,Rs . )  or  (r,  ,r  , Rs  .  )  respectively, 

U  1  DC] 

where  Rs^  and  Rs^  are  the  related  reachability  sets. 

(ii)  Traverse  the  chanqe  sequence  backwards  and  for 
each  chanqe  sequence  tuple,  ( r^ ,  r  ),  perform  a,  b  and  c 
below  and  stop. 

(a)  If  each  of  r^,  r^  is  not  in  any  3-tuple  x  of 
A,  create  a  new  reachability  set  Rs ■  with  individuals  rv , 

r t ,  and  create  the  set  element  (r^,,  r t ,  Rs^)  for  A. 

(b)  If  only  r^(r  )  is  in  a  3-tuple  x  of  A  then 
replace  r^(rt)  with  rt(rk)  in  x.  Add  rt(r^)  to  the 
reachability  set  specified  in  the  3-tuple  x. 

(c)  If  both  r^  and  r^  are  in  a  3-tuple  x  of  A 
then  remove  x  from  A. 

If  each  external  node  in  is  linked  to  the  external 
node  to  its  riqht,  then  step  i  can  be  performed  bv  scanninq 
the  active  individuals  set  in  0(k)  time,  where  k  is  the 
number  of  individuals  in  the  active  individuals  set.  Set  A 
can  be  implemented  as  a  dictionary.  For  example,  a  balanced 
tree  S  whose  leaf  node  containinq  an  individual  in  a  3-tuole 

of  A  ,  related  reachability  set  and  a  pointer  to  the  other 

\ 

individual  in  the  3-tuple  mav  be  used.  Then  step  (ii)  takes 
O(tloq2s)  time  where  t  is  the  number  of  tuples  in  the  chanqe 
sequence  and  s  is  the  maximum  size  of  A  at  any  time.  Thus 
operation  3  takes  0(k+tloq2s)  time. 


92 


Another  implementation  issue  is  a  fast  access  to  all  A- 
Dopulations  of  a  qiven  population,  which  is  needed  to 
enforce  SA-constrarnts .  This  problem  requires  a  traversal  in 
the  related  UKC.  Note  that  UKCs  are  expected  to  be  very 
larqe  data  structures.  In  the  D-A  model,  this  problem  takes 
the  form  of  hierarchical  traversal.  Since  the  UKC  is  very 
larqe,  one  may  assume  that  it  resides  in  the  secondary 
memory.  Each  user  query  triqqers  a  traversal  in  the  UKC,  and 
in  a  paqinq  environment,  there  are  paqe  defaults.  The  aim  is 
to  minimize  these  defaults. 


It  may  be  assumed  that  the  population  specified  by  the 
query  is  accessed  usinq  a  table  or,  for  example,  usinq  base 
displacement.  Thus  it  is  sufficient  to  consider  only  the 
problem  of  findinq  its  A-populations .  Clearly,  qiven  a 
nopulation,  there  are  several  paths  to  access  its  A- 
populations  since  a  node  in  the  UKC  mav  have  several 
clusters.  Thus  access  paths  in  the  UKC  which  require  minimum 
number  of  node  visits  must  be  determined  first.  This  can  be 
performed  in  0(n)  time,  where  n  is  the  number  of  nodes  in 
the  hierarchy,  by  a  simple  alqorithm  as  follows:  (a)  start 
from  level  k,  k=(hiqhest  level  in  the  hierarchy) -1 ,  (b)  for 

each  node  at  level  k,  find  its  cluster  to  be  used  in  the 
access  path.  Record  the  "number  of  edqes  to  be  traversed", 
(c)  decrease  k  by  1 .  If  k>l,  qo  to  (a)  otherwise  stop. 


the  frequency  of  access  of  each  node  v^  in  the 


As  sumi nq 
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hierarchy  is  R(v^)  and  P  is  the  page  size,  the  problem  now 
becomes  optimum  partitioning  of  the  graph  formed  by  access 
paths.  This  problem  is  known  to  be  NP-complete  [Garey  and 
Johnson,  1979].  There  are  however  some  heuristics  which 
require  0(n  )  or  0(n  log^n)  time  [Kernighan  and  Lin,  1970]. 
One  approach  is  to  change  the  access  path  graph  into  a  tree 
by  an  approximation  algorithm  in  0(n)  time,  and  then  find 
the  optimum  partitioning  of  the  tree  using  dynamic 
programming  approach  in  0(nP  )  time  [Lukes,  1974]. 

4.4  Protection  Requirements  for  Different  Statistical 
Queries 

In  this  section  possible  security  constraints  for 
different  statistical  query  types  are  investigated.  First, 
inferences  available  to  users  are  identified,  then  related 
security  constraints  to  enforce  security  are  briefly 
described.  One  distinguishes  three  different  types  of 
inferences  by  users . 

(a)  Type  S  inferences  due  to  the  hierarchical  structure 
of  the  conceptual  model . 

(b)  Type  D  inferences  due  to  the  dynamics  of  the  real 
world . 

(c)  Type  R  inferences  due  to  existing  relationships 
between  individuals  in  different  populations  or  in  the  same 
population. 


Disclosures  in  examples  4.1,  4.4  and  4.5  are  due  to 
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type  R  inferences.  Type  R  inferences  are  dependent  on  the 
specific  environment.  In  this  chapter,  it  is  assumed  that 
global  constraints  are  defined  by  the  DBA  to  prevent 
disclosures  due  to  type  R  inferences.  That  is,  the  DBA  is 
responsible  for  identifying  type  R  inferences  and  applying 
protection  measures  to  prevent  compromise.  Another  approach 
may  be  to  define  formally  type  R  inferences  and  use  a 
theorem  prover  to  decide  about  the  inferred  knowledge  and 
the  disclosed  information.  In  Chapter  5,  this  approach  is 
outlined,  which  uses  a  Question-Answering  System  to  enhance 
the  security  of  the  SDB . 

Below  only  type  S  and  type  D  inferences  are  considered. 
The  following  are  suggested  schemes  for  a  sample  of 
different  types  of  statistical  queries;  the  others  can  be 
derived  similarly. 

4.4.1  COUNT  Queries 

Assume  only  COUNT  queries  are  allowed  and  individuals 
in  populations  are  identifiable.  Assume  Systems  Analyst  with 
Bs  is  decomposed  into  two  subpopulations  as  "Systems  Analyst 
with  Bs  and  convicted  of  felony"  and  "Systems  Analyst  with 
Bs  and  not  convicted  of  felony".  It  is  well  known  [Hoffman 
and  Miller,  1970]  that 

COUNT( Systems  Analyst ) =COUNT ( Systems  Analyst  with  Bs ) 
with  Bs  and  convicted  of  felony 

John  Doe  is  a  Systems  Analyst  with  Bs 
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- >  John  Doe 

Similarly, 


is  convicted  of 


felony 


+ 


P  COUNT ( Systems  Analyst  with  Bs  and 
not  convicted  of  felony)=0 
John  Doe  is  a  Systems  Analyst  with  Bs 


*■  John  Doe  is 
convicted  of 
felony . 


Thus  for  type  S  inferences  we  need  the  following  global 
constraints . 

(a)  in  population  SYST-ANALYST-BS-CONV-FELONY, 

If  COUNT ( any  superpopulation  of  SYST-ANALYST-BS-CONV-FELONY) 
-  COUNT( SYST-ANALYST-BS-CONV-FELONY)  i  a 

then  individuals  creating  above  difference  are  not 
reported  in  COUNT  queries. 


(b)  in  population  SYST-ANALYST-BS-NOT-CONV-FELONY, 

If  COUNT ( SYS T-ANALYST-BS-NOT-CONV-FELONY)  i  a 

then  COUNT ( SYS T-ANALYST-BS-NOT-CONV-FELONY) 
is  not  answered. 


The  above  disclosure  type  and  constraint  have  been  proposed 
and  discussed  widely  in  recent  studies  [Schlorer,  1975  ; 
Schlorer,  1976;  Denning  et  al . ,  1979].  If  introduced  in  a 
controlled  environment,  it  is  a  viable  protection  procedure. 

The  constraints  described  above  can  be  easily  modified 
to  consider  the  users'  knowledge  described  in  the  UKC  by 
changing  a  to  (a+x)  where  x  is  the  size  of  the  user  group's 
knowledge  in  the  population. 

Type  D  inferences  may  be  avoided  likewise,  e.g.  only 


+ 


>  means  "implies"  or  "imply" 
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when  there  are  (al+x)  insertions  or  (a2+x)  deletions  from 
either  populations  of  SYST-ANALYST-BS-CONV-FELONY  or  SYST- 
ANALYST-BS-NOT-CONV-FELONY  then  changes  are  reported  to  user 
group  u  where  insertion  and  deletion  of  SYST-ANALYST-BS  are 
identifiable . 

4.4.2  SUM  and  COUNT  Queries 

Consider  Figure  4.4  and  assume  only  SUM  and  COUNT 
queries  are  allowed  for  all  populations.  Assume  further  that 
users  in  user  group  u  can  always  identify  if  a  programmer  is 
hired  or  fired  but  cannot  identify  if  he  is  a  systems 
programmer  or  a  systems  analyst.  Now  any  changes  in 
populations  Systems  Analyst  or  Systems  Programmer  can  be 
reported  to  user  group  u  immediately  but  care  must  be  taken 
in  reporting  changes  in  the  population  Programmer.  In  what 
follows  it  is  assumed  that  insertions  and  deletions  into  a 
population  are  identifiable  and  only  type  D  inferences  are 
considered.  Later  type  S  inferences  are  also  discussed. 

Using  the  information  graph,  it  is  shown  in  Chapter  3 
that  there  are  several  ways  of  securing  the  SDB  for  SUM  and 
COUNT  queries.  Clearly,  these  results  should  incorporate 
users'  knowledge  in  order  to  be  meaningful.  If  individuals 
with  known  (or  suspected)  property  values  are  processed  in 
pairs  only  with  known  individuals  then  the  SDB  will  still  be 
secure,  since  unknown  values  will  be  separated  from  known 
values  in  the  equations  in  pairs.  Thus  in  order  to  prevent 
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disclosures  due  to  type  D  inferences,  we  have 

(a)  populations  with  even  number  of  individuals,  and 

(b)  for  each  SA-value  set  in  each  UKC,  there  is  a  SA- 
constraint  which  delays  the  processing  of  (known  and/or 
unknown)  individuals  with  recent  changes  until  another 
change  occurs,  (see  SA-constraint  in  Figure  4.9.) 

Clearly,  if  each  population  has  an  even  number  of 
individuals  then  the  size  difference  between  a  population 
and  any  of  its  parent  populations  must  be  either  zero  or  at 
least  two.  Thus  type  S  inferences  cannot  cause  any 
disclosure. 

A  population  A  may  contain  several  SA-populat ions  and 
thus  several  individuals  may  be  waiting  for  inclusion  in 
statistical  answers  of  A.  To  prevent  that,  when  t  unknown 
(known)  individuals  are  waiting  then  they  may  be  included  in 
the  statistical  answers  of  A.  This  can  be  specified  by 
global  constraints  in  A  and  its  parent  populations.  The 
bound  t  must  be  large  enough  so  that  the  information 
revealed  to  users  will  be  practically  useless  for  all 
disclosure  purposes. 

4.4.3  MEDIAN  and  COUNT  Queries 

Assume  only  MEDIAN  and  COUNT  queries  are  allowable  for 
all  populations  in  the  conceptual  data  model.  Both  type  S 
and  type  D  inferences  may  lead  users  to  obtain  upper  or 
lower  bounds  for  protected  property  values  if  insertions. 


98 


deletions,  or  updates  are  identifiable.  The  following 
examples  illustrate  the  possible  inferences. 

Example  4.6  (type  S  inference) 

Consider  Figure  4.4.  Assume  individuals  are 
identifiable  and  there  is  only  one  Programmer  with  Ms  whose 
salary  is  x.  Thus  queries  about  Programmer  with  Ms  are  not 
permitted.  However,  the  following  information  is  obtainable. 
MED IAN ( PROGRAMMER-BS, SALARY) =a  MED IAN ( PROGRAMMER, SALARY) =al 
COUNT( PROGRAMMER-BS )=n  COUNT ( PROGRAMMER) =n+l 

Now,  al>a  >  x>a  and  al<a  — >  x<a 

Thus  one  has  an  inference  about  the  salary  x  of  the  only 
programmer  with  Ms. 

Example  4.7  (type  D  inference) 

Consider  Figure  4.4  and  assume  changes  are 
identif iable . 

MEDIAN ( SYSTEMS-ANALYST-WITH-BS, SALARY) =a 
COUNT ( SYSTEMS -ANAL YST-WITH-BS ) =m 
Assume  a  new  Systems  Analyst  with  Bs  and  salary  x  is  hired: 
MED IAN( SYSTEMS-ANALYST-WITH-BS, SALARY) =al 
COUNT ( SYSTEMS-ANALYST-WITH-BS )=m+l 
Now  al  >a  - *  x>a  and  al<a  - — >  x<a 


We  would  like  to  avoid  above  inferences. 
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Processing  Changes  in  Pairs: 


First  type  D  inferences  will  be  considered  then  type  S 
inferences  will  be  discussed.  Consider  population  A  with 
protected  property  B  and  assume  all  changes  are  processed  in 
pairs . 

MEDIAN(A, B)=a  COUNT(A)=n 

Two  individuals  with  property  B  values  x  and  y  are  added  to 
A,  then 

MEDIAN ( A , B ) =al  C0UNT(A)=n+2 

Now  (a)  n  is  odd 

al<a  — >  x,y  <  al 

al > a  - >  x, y  >  al 

(b)  n  is  even 

al<a  >  at  least  one  of  x,y  <  al 

al>a  — >  at  least  one  of  x,y  >  al 
Clearly,  (b)  is  better  than  (a)  in  the  sense  that  it  does 
not  allow  an  upper  or  lower  bound  inference  for  any  of  the 
property  values  x,  y. 

Consider  now 


MED IAN ( A , B ) =a 


COUNT ( A) =n 


an  individual  with 
property  x  is  added 
to  A 

an  individual  with 

property  y  is  deleted 
from  A 


> 


ME  D I AN ( A , B ) =a 1 
COUNT (A) =n 


and  we  have  the  following  inferences 
(a)  n  is  odd 
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al  <  a  — >  (x  <  al  )  &  (  y  >  a) 
al > a  — >  (x  >  al ) & ( y  <  a) 

(b)  n  is  even 

al  <a  - >•  x  <  y 

al  >  a  - >  x  >  y 

Thus  if  every  population  always  has  an  even  number  of 
individuals  then  processing  changes  in  pairs  prevents  any 
direct  inference  of  the  property  values  of  individuals.  One 
can  also  use  this  result  for  type  S  inferences  by  having  the 
global  constraint  that  the  difference  in  size  between  any 
population  and  its  parent  population  should  at  least  be  two. 
This  requirement  actually  is  always  satisfied  if  all 
populations  start  with  an  even  number  of  individuals. 

Processing  Changes  in  Triplets: 

Having  populations  with  an  even  number  of  individuals 
and  processing  changes  in  pairs  still  has  the  following 
deficiency.  Let  the  median  value  a  be  the  average  of  two 
protected  property  values  u  and  v,  u<v;  and  two  individuals 
with  protected  property  values  x  and  y  are  added  into  the 
population,  then 

x,  y  ^  [u,v]  &  (al  <a)  >  x,y<al 

and  x ,  y  g  [u,v]  &  (aha)  — >  x,y>al. 

Thus  for  large  population  size  n  and  al<a  it  is  highly 
probable  that  x,  y<al  or  similarly  for  large  n  and  aha,  we 
have  x, y>al  with  high  probability.  If  this  deficiency  and 
the  requirement  of  even  population  size  are  not  tolerable 
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then  changes  are  processed  in  triplets.  The  following 
inferences  are  possible. 

Assume  MEDIAN( A, B)=a  and  COUNT(A)=n. 

(a)  individuals  with  property  values  x,y,z  are  added  to 
population  A. 

MED IAN ( A , B ) =a 1  C0UNT(A)=n+3 

For  even  or  odd  n, 

al<a  >  at  least  two  of  x,y,z<al 

al>a  >•  at  least  two  of  x,y,z>al 

(b)  an  individual  with  property  value  x  is  deleted  from 
A  and  individuals  with  property  values  y,z  are  added  to  A. 

MED  IAN ( A , B ) =a 1  COUNT(A)=n+l 

For  even  or  odd  n, 

al<a  - >  at  least  one  of  y, z<al 

al>a  - at  least  one  of  y, z>al 

Other  changes  (i.e.  one  addition  and  two  deletions  or  three 
deletions)  result  in  similar  inferences. 

Similarly,  to  prevent  disclosures  due  to  type  S 
inferences  one  can  have  a  global  constraint  that  the  size 
difference  between  any  population  and  its  parent  population 
should  at  least  be  three  or  the  individuals  creating  this 
difference  are  not  reported.  Also  hierarchical  structure  of 
the  conceptual  model  should  be  taken  into  account  for  type  D 
inferences.  Consider  Figure  4.4,  assume  three  Systems 
Analysts  with  Bs  and  with  salaries  x,y,z  are  hired  and 
median  salaries  of  populations  Systems  Analyst  with  Bs, 
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Systems  Analyst  and  Computer  Scientist  have  changed  from 
al,a2,a3  to  bl,b2,b3  where  bl<al,  b2<a2,  b3<a3.  Now  if  the 
DBA  learns  that  user  group  u  had  in  fact  known  the  value  of 
z  >  max{ al , a2 , a3 }  then  x,y  <  min{ bl , b2 , b3 }  is  revealed.  Thus 
this  inference  should  be  recorded  into  the  UKC  of  user  group 
u . 


Changes  (whether  processed  in  pairs  or  triplets)  should 
be  recorded  into  the  related  change  sequence  for  auditing 
and  for  other  tasks  of  the  CEC  described  in  Section  4.3.3. 
Also  individuals  with  known  (or  suspected)  property  values 
should  be  processed  only  with  known  individuals  to  avoid 
direct  inferences  about  protected  property  values. 


\ 


CHAPTER  5 


EXTENSIONS  FOR  A  SECURE  SDB  DESIGN 

Extensions  to  the  protection  scheme  of  employing 
constraints  in  the  conceptual  data  model  are  discussed  in 
order  to  increase  the  e f feet ivene ss  of  the  protection  scheme 
and  the  richness  of  the  SDB.  It  is  argued  that  a 
Question-Answering  System  with  deductive  inference 
mechanisms  may  be  very  useful  for  the  database  system  in 
deciding  inferred  knowledge.  A  set  of  security-related 
commands  for  controlled  changes  about  individuals  and 
populations  in  the  conceptual  data  model  are  proposed.  The 
benefits  of  a  security  kernel  architecture  for  the  SDB  are 
discussed.  Finally,  a  complete  SDB  design  is  discussed, 
which  includes  protection  data,  Question-Answering  System, 
security-related  commands  and  a  security  kernel. 

5.1  Introduction 

This  section  discusses  possible  extensions  to  the  SDB 
design  in  Chapter  4. 

5.1.1  Inferred  Knowledge  and  a  Question-Answering  System 

In  Chapter  4,  inferences  due  to  existing  relationships 
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in  the  real  world  environment  (i.e.  type  R  inferences)  are 
controlled  in  an  ad  hoc  manner.  That  is,  the  DBA  is 
responsible  for  identifying  these  general  rules  and  applying 
security  measures  to  prevent  compromise.  Moreover  the  SDB 
design  in  Chapter  4  does  not  maintain  users'  knowledge  of 
general  rules.  The  UKC  contains  a  part  of  users'  knowledge, 
namely,  the  protected  property  values  of  individuals. 
Although  the  SDB  Knowledge  Set  defined  in  Section  4.2.2 
contains  users'  knowledge  relevant  to  the  application,  the 
UKC  does  not  contain  users'  knowledge  of 

(a)  explicit  facts  that  are  not  represented  in  the 
conceptual  model  which  may  be  utilized  for  disclosure,  e.g. 
"individual  a  is  the  brother  of  individual  b",  and 

(b)  general  rules  that  may  lead  to  disclosure,  e.g. 

(1)  Income  of  a  and  b  is  x. 

(2)  b  is  not  working  and  Income  of  a=100x/ ( 1 00+w) 

has  no  extra  income.  and 

>  Income  of  b=wx/(100+w) 

(3)  b  is  the  ex-wife  of  a. 

(4)  Every  nonworking  ex-wife 

gets  w%  of  the  income  of 
ex-husband.  - 

Global  security  constraints  are  proposed  in  Chapter  4  to 
avoid  this  kind  of  compromise.  However,  facts  (2)  and  (3) 
may  not  exist  in  the  conceptual  model  (they  are  perhaps  only 
needed  for  security  purposes),  and  general  rule  (4)  may  (or 
may  not)  be  known  to  some  users.  Clearly,  a  static  global 
constraint  may  fall  short  for  security  depending  on 
particular  users'  additional  knowledge.  For  example,  a 
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certain  user  equipped  with  other  explicit  facts  and  general 
rules  may  be  able  to  deduce  more.  A  better  approach  may  be 
the  formal  treatment  of  the  question  "what  can  be  inferred 
with  certain  knowledge?".  (It  is  assumed  that  the  database 
system  is  correctly  informed  of  each  user's  knowledge). 

In  order  to  maintain  users'  additional  knowledge 
relevant  to  the  security  of  the  SDB,  and  make  decisions 
about  the  implicit  deducible  information,  the  following 
extensions  are  now  proposed. 

(1)  As  described  in  Section  4.2.2,  a  Knowledge.  Set  for 
each  user  group,  containing  security-relevant  general  rules 
and  explicit  facts  that  are  known  by  the  users  and  are  not 
described  in  the  conceptual  model,  and 

(2)  a  Question-Answering  System  ( QAS )  with  a  deductive 
capability.  Notice  that  the  Knowledge  Sets  and  the  QAS  will 
only  be  used  by  the  DBA  or  the  security-related  procedures. 
Advantages  of  adding  the  QAS  with  deductive  capability 
include  the  following : 

(a)  The  QAS  may  help  the  DBA  in  the  problems 
concerning  security-related  decisions  by  answering 
conditional  questions.  For  example,  "Is  the  salary  of 
dataperson  A  deducible  by  user  group  u  if  A  is  assigned  to 
project  p?"  may  be  answered  by  the  QAS  as  "yes,  if  he  is 
assigned  as  a  manager  or  if  he  is  an  engineer  with  a  Ph . D . " . 
The  QAS  can  generate  answers  for  the  DBA  to  those  questions 
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asking  whether  compromise  may  occur  (or  has  occurred)  during 
events  such  as  changes  in  the  database  (say,  insertion  of 
individuals),  changes  in  the  conceptual  model  (say,  new 
populations)  or  changes  in  the  knowledge  set  of  users  (say, 
learning  a  new  general  rule).  If  some  of  the  actions  to 
these  events  are  pre-specif ied  (such  as  enforcing  a  certain 
constraint),  then  these  events  may  be  automatically 
processed  by  security-related  procedures  without  consulting 
the  DBA. 

(b)  The  QAS  may  return  the  reasoning  for  its  answers 
about  the  security  of  information  in  the  SDB .  This  helps  the 
DBA  decide  how  and  where  to  apply  security  constraints 
effectively  without  unnecessarily  reducing  the  richness  of 
the  database.  Also,  even  in  cases  when  the  QAS  is  unable  to 
give  a  definite  answer,  its  line  of  reasoning  may  help  the 
DBA  to  decide  about  the  answer. 

Since  type  S  and  type  D  inferences  may  be 
pr e-inve stiga ted  without  using  the  QAS,  analysis  and 
prevention  of  type  S  and  type  D  inferences  can  be  dealt 
separately  so  as  to  increase  the  efficiency  of  the 
protection  and  the  QAS  may  be  used  only  when  most  needed. 

Several  QASs  have  been  reported  in  the  literature 
[Green,  1969;  Minker  et.  al. ,  1973].  Mostly,  the  resolution 
[Robinson,  1965]  technique  with  several  improved  search 
strategies  [Chang  and  Lee,  1973;  Nilsson,  1971]  has  been 
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used  to  derive  answers.  However,  lately,  Natural-Deduction 
systems  [Bledsoe  and  Bruell,  1974]  have  been  offered  as 
alternatives  to  resolution. 


The  idea  of  incorporating  a  deductive  capability  into  a 
database  system  is  not  new.  Kellogg  et.  al.  [Kellogg  et. 
al.  ,  1976]  reports  the  implementation  of  a  relational 
database  with  a  deductive  processor  and  general  rules  file. 
Minker  [Minker,  1978]  proposes  an  inferential  system  in 
which  there  are  explicit  facts  (i.e.  extensional  database) 
and  general  rules  (i.e.  intensional  database).  The  general 
rules  and  explicit  facts  are  used  to  derive  implicit  facts 
within  the  system.  For  relational  databases  with  large  sets 
of  explicit  facts  and  relatively  small  sets  of  inference 
rules,  Reiter  [Reiter,  1978]  proposes  a  theorem  prover  which 
only  looks  at  the  intensional  database.  The  theorem  prover ' s 
output,  which  is  a  set  of  queries,  is  then  extensional ly 
eva luated . 


For  the  SDB  security,  the  main  concern  is  to  decide 
whether  or  not  a  property  value  of  a  single  individual  is 
compromised.  The  deductive  search  for  an  answer  to  this 
question  is  likely  to  deal  with  very  few  explicit  facts. 
Moreover,  the  number  of  general  rules  relevant  to  the 
deductive  process  of  answering  a  particular  security 
question  is  usually  small.  Aiding  the  deductive  search  with 
semantic  information  may  be  very  useful  in  increasing  the 
efficiency  of  the  QAS .  To  state  explicit  facts  and  general 
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rules,  1st  order  predicate  calculus  seems  to  be  generally 
sufficient.  In  those  cases  where  higher  order  general  rules 
in  the  real  world  exist,  the  direct  intervention  and 
analysis  of  the  DBA  are  assumed.  Below,  the  input  and  the 
output  of  the  QAS  are  briefly  described. 

Using  the  definition  of  SDB  security  in  Section  4.2.2., 
the  database  system  is  expected  to  protect  confidential 
facts  and  some  nonconf iden tial  facts.  This  protection  can  be 
reduced  to  the  protection  of  property  values  of  individuals 
in  the  SDB.  The  fact  that  individual  a  has  the  property 
value  s  can  be  represented  by  the  predicate  P(a,s).  When  the 
DBA  wants  to  check  whether  user  u  can  disclose  the  property 
value  s  of  individual  a,  he  asks  the  question  Q:  Ex  P(a,x)+ 
together  with  a  pointer  to  the  Knowledge  Set  of  user  u.  In 
other  words.  The  DBA  wants  to  prove  that  the  well -formed 
formula  Ex  P(a,x)  is  valid.  If  Q  is  valid  then  as  the 
answer,  "yes"  and  a  term++  are  returned.  If  Q  is  not  valid 
then  there  are  two  possibilities;  either  the  inference 


^x  means  "there  exists  x". 

++The  answer  can  be  (i)  a  term  involving  only  known 
(non-skolem)  constants,  functions,  and  variables,  or  (ii)  a 
term  involving  a  skolem  constant  or  function.  In  case  (i),  u 
can  compromise  the  value  s.  In  case  (ii),  the  x  in  question 
is  known  to  exist,  but  because  of  the  involvement  of  skolem 
constant  or  functions,  its  value  remains  unknown.  If  the 
user  u's  knowledge  of  the  existence  of  s  is  not  desirable  by 
the  system  then  the  answer  "yes"  will  protect  this 
information.  Otherwise,  the  answer  "yes"  may  be  replaced 
with  "no,  insufficient  information  due  to  a  skolem  (cont'd) 
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procedures  run  out  of  time,  space  or  some  other  criteria  in 
which  case  they  return  "cannot  be  answered  due  to  search 
limitations"  or  the  inference  procedures  come  to  a  halt  in 
searching  relevant  inferences  in  which  case  the  answer 
returned  is  "no  due  to  insufficient  information".  The  latter 
answer  simply  says  that  the  system  cannot  infer  that  there 
exists  x  with  P(a,x)  using  the  "given"  Knowledge  Set  of  the 
user  u  and  hence,  nor  can  the  user  u.  Since  the  DBA  is  only 
interested  in  whether  or  not  user  u  can  infer  P(a,s),  for 
our  purposes,  a  "no  due  to  insufficient  information"  answer 
implies  the  security  of  the  protected  value  s.  If  the  QAS 
returns  "cannot  be  answered  due  to  search  limitations", 
either  the  DBA  is  notified  or  the  search  is  re-tried  with 
different  parameters. 

5.1.2  Changes  and  Adaptability  of  the  Conceptual  Model 

It  is  generally  agreed  [Weber,  1976;  Winograd,  1973] 
that  a  static  representation  of  knowledge  is  insufficient  to 
model  the  real  world  correctly.  Moreover  completeness  of  a 
conceptual  model  is  dependent  on  (a)  the  ability  of  the 
designers  who  create  the  model  and  (b)  the  purpose  the  model 
serves  [Weber,  1976].  Therefore  every  aspect  of  the 


function".  Certainly  the  answer  "no"  may  unnecessarily 
restrict  the  usefulness  of  the  QAS.  Perhaps  a  change  in  the 
theorem  prover  may  result  with  case  (i)  as  an  answer  rather 
than  case  (ii).  More  research  is  needed  to  eliminate  case 
(ii)  as  the  answer  to  the  question. 
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conceptual  data  model  should  have  dynamic  capabilities,  i.e. 
populations  may  be  formed,  deleted  or  decomposed; 
individuals  may  be  inserted,  deleted  or  (their  properties) 
updated;  relationships  may  be  added  or  deleted  from  the 
conceptual  model.  For  the  SDB,  these  capabilities  are 
procedurally  described  by  the  Operation  Types  Set  in  Section 
4.2.2  and  are  of  utmost  importance  since  users  do  not  have 
high  level  data  manipulation  operators.  Note,  however,  that 
changes  to  the  SDB  data  model  should  either  be  extensions  or 
reflect  changes  in  the  real  world  environment  [Date,  1977] . 
Moreover,  these  changes  have  to  be  made  only  by  the  DBA. 

The  conceptual  model  of  a  general-purpose  database  is 
defined  by  means  of  the  conceptual  schema  which  is 
represented  by  a  specially  provided  language,  perhaps  a  data 
definition  language  [Date,  1977].  The  compiled  form  of  the 
conceptual  schema  is  used  by  the  database  management  system 
and  the  source  form  serves  as  a  reference  document.  For  a 
general-purpose  database,  changes  to  the  conceptual  model 
can  be  achieved  by  re-writing  a  new  conceptual  schema  and 
then  compiling  it.  However,  for  the  SDB  schema  (defined  in 
Section  4.2.2),  (a)  structural  rules,  if  any,  have  to  be 

satisfied  (e.g.  rule  1.  in  Section  4.2.3),  (b)  the  danger  of 

compromise  due  to  changes  has  to  be  checked,  (c)  users' 
additional  knowledge  has  to  be  recorded  in  the  UKCs  and 
Knowledge  Sets,  (d)  PDCs  for  new  populations  have  to  be 
constructed,  and  (e)  security  constraints  related  with  the 
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changes  have  to  be  specified.  It  is  therefore  desirable  to 
have  a  set  of  high-level  operations  (  or  commands)  for 
changes  in  the  SDB  schema.  These  operations  are  procedurally 
defined  by  type-2  operation  types  in  SDB  data  model  and 
Knowledge  Base  Operation  Types  Set  in  Section  4.2.2. 

The  job  of  the  SSMF  can  be  extended  to  check  for 
compromise  by  consulting  the  QAS  and  to  process  (c)-(e).  One 
important  advantage  of  these  additions  to  the  SDB  is  to  ease 
the  job  of  the  DBA  by  deferring  some  of  his  duties  to  the 
SSMF  . 

5.1.3  Security  Kernel  for  the  SDB 

Certification  of  the  security  mechanism  of  the  SDB  is 
needed  to  guarantee  that  it  works  correctly.  Since 
certification  by  proving  the  correctness  of  programs  is 
known  to  be  a  difficult  task,  it  is  desirable  to  minimize 
the  software  needed  for  certification. 

Security  kernels  have  been  successfully  applied  in 
operating  systems  in  order  to  improve  the  reliability  of  the 
system.  Recently  ([Downs  and  Popek ,  1977],  [Downs  and  Popek, 

1979])  have  reported  a  general  design  for  a  secure  database 
management  system  (DBMS),  together  with  a  case  study 
implementation  using  INGRES  relational  DBMS.  The  design 
supports  data  security  through  the  use  of  a  kernel 
architecture  which  minimizes  and  encapsulates  the  software 
upon  which  correct  protection  enforcement  depends.  The  main 
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advantage  of  a  security  kernel  is  to  reduce  the  security 
relevant  code  for  certification. 

In  statistical  databases,  a  security  kernel  can  be 
designed  by  having  certified  separate  modules  containing  all 
security  relevant  code.  The  security  kernel  design  in 
Section  5.3  may  be  considered  as  the  extension  of  [Downs  and 
Popek ,  1977]  to  statistical  databases. 

5.2  The  Extended  SDB  Design 

In  this  section,  an  SDB  design  to  include  the 
considerations  in  Section  5.1  is  discussed.  To  illustrate 
the  design,  the  D-A  Model  is  used  as  the  conceptual  data 
model  of  the  SDB.  However  the  design  is  independent  of  the 
choice  of  data  model  and,  may  easily  be  modified  for  any 
other  structured,  redundant  and  semantic  data  model . 

Some  users  of  the  SDB  have  statistical  access  to  the 
database  as  well  as  primitive  change  operations  such  as 
retrieval,  insertion,  deletion  or  updates  of  some 
individuals  in  the  SDB.  For  example,  managers  may  change  the 
information  about  employees  in  their  own  department,  or  an 
employee  may  retrieve  and  change  his  own  information  in  the 
SDB.  In  order  to  avoid  simple  direct  disclosures,  users  may 
be  permitted  to  insert  or  delete  any  individual  only  once, 
and  changes  are  proposed  to  be  revealed  to  users  in  pairs, 
triplets,  etc.  (See  Chapter  3).  It  is  also  assumed  that 
different  views  of  the  data  and  high-level  data  manipulation 
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operators  such  as  join,  project  operators  in  the  relational 
model  are  not  permitted  to  users  for  statistical  queries. 

Clearly,  letting  users  perform  controlled  insertions, 
deletions  and  updates  makes  the  authorization  and 
enforcement  of  authorization  rules  a  formidable  task. 
However,  in  the  USA,  for  example,  laws  concerning  privacy 
require  [Privacy  Act,  1974;  Davida  et.  al . ,  1978]  that 

(1)  a  means  must  be  provided  for  an  individual  to 
review  his  own  information  in  database  and  how  it  is  used, 

(2)  the  individual  must  be  provided  with  a  means  of 
correcting  or  amending  his  own  identifiable  information. 

In  the  SDB  design,  it  is  assumed  that  authorization  and 
enforcement  of  authorization  rules  are  achieved  using 
mechanisms  similar  to  one  of  [Hartson  and  Hsiao,  1976; 
Fernandez  et.  al . ,  1976;  Griffits  and  Wade,  1976]  and  will 
not  be  discussed  here. 

In  the  SDB,  users  and  the  DBA  deal  with  logical 
objects,  and  security  constraints  are  defined  over  logical 
individual  objects  (or  individuals)  in  the  conceptual  data 
model.  Since  the  enforcement  of  security  constraints  (and 
authorization  rules)  are  most  reliably  done  at  access  time 
on  physical  objects,  a  secure  mapping  of  physical  and 
logical  individual  objects  is  needed  for  a  secure  SDB.  In 
[Downs  and  Popek,  1977]  this  mapping  is  achieved  by  using 
tags  on  each  separately  protectable  data  in  the  physical 
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database.  In  this  design  the  same  mechanism  is  used,  and 
each  property  value  of  an  individual  in  the  physical 
database  is  attached  with  its  logical  name  and  the  logical 
name  of  its  property. 

The  DBA  and  the  SSMF  have  access  to  the  protection  data 
which  is  described  in  Section  5.2.1.  Section  5.2.2  contains 
descriptions  of  the  (extended)  SSMF  and  the  QAS .  The  DBA  can 
insert,  delete  or  update  (properties  of)  individuals  in  the 
SDB,  make  changes  in  the  conceptual  model  and  use  other 
security-related  commands  to  insure  the  security  of  the  SDB. 
Section  5.2.3  discusses  the  set  of  security-related  commands 
available  to  the  DBA.  Section  5.3  discusses  how  statistical 
queries,  ins ert i on , del et ion  and  update  queries  of  users  as 
well  as  commands  available  to  the  DBA  are  processed. 

5.2.1  The  Protection  Data 

The  protection  data  contains  the  following. 

1)  A  Knowledge  Set  which  contains  a  user  group's 
additional  knowledge  of  general  rules  and  explicit  facts  and 
a  User  Knowledge  Construct  (UKC)  are  needed  for  each  user 
group.  General  rules  and  explicit  facts  are  properly 
expressed  first  order  predicate  calculus  formulas.  The  UKC 
contains  SA-constraints,  some  global  constraints,  users' 
knowledge  of  the  protected  property  values  in 
SA-popul ations,  change  sequences  and  change  sequence 
parameters.  Note  that  now,  unlike  in  Chapter  4,  the  UKC  may 
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contain  some  global  constraints  to  avoid  some  type  S  and 
type  D  inferences.  The  reason  for  this  change  is  to  enable 
the  DBA  to  apply  different  constraints  to  different  users  so 
that  the  richness  of  the  SDB  may  be  less  restricted. 

2)  A  Population  Definition  Construct  (PDC)  for  each 
popul  ation . 

3)  Logical  Name  Tables  of  individual  objects  in  the 
SDB.  To  achieve  correct  mapping  between  physical  and  logical 
individual  objects,  the  logical  name  tables  of  individuals 
are  maintained. 

4)  Authorization  Information. 

5.2.2  The  (Extended)  SSMF  and  the  QAS 

The  (Extended)  SSMF  consists  of  three  certified 
modules:  the  Query  Controller  (QC),  the  Constraint  Enforcer 
and  Checker  (CEC)  and  the  Conceptual  Model  Modifier  (CMM). 
All  three  modules  of  the  SSMF  have  access  to  the  protection 
data.  The  Question-Answering  System  (QAS)  is  not  certified, 
but  works  isolated  from  users,  has  only  read  access  to  the 
protection  data  and  is  used  only  by  the  SSMF  and  the  DBA. 
Similar  to  [Downs  and  Popek ,  1977],  the  security  unrelated 
portions  of  the  database  management  system  are  separated  and 
put  into  another  module:  the  Database  Management  Module 
( DBMM ) . 


Statistical  queries  of  users  should  specify  the 
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population's  name  (e.g.  EMPLOYEE)  and  property  (e.g. 

SALARY),  and  the  statistical  query  type(s)  (e.g.  SUM).  These 
queries  are  called  popul at ion- specif ied  queries.  Another 
variation  to  population-specified  queries  is  by  specifying 
the  characteristics  of  the  population  using  conjunctions  of 
boolean  clauses  instead  of  the  population's  name.  These 
queries  are  called  character is tic -specified  queries. 
Functions  of  the  QC  include  (a)  parsing  the  query  and  (b) 
mapping  characteristic-specified  populations  into  existing 
populations  in  the  conceptual  model. 

Function  of  the  CMM  is  to  help  the  DBA  to  assess 
possible  inferences  when  there  are  changes  in  the  conceptual 
model  (i.e.  when  using  conceptual  model  modification 
commands),  and  the  CMM  uses  the  QAS  for  deciding  about  the 
security  of  individuals. 

The  CEC  utilizes  the  PDC,  the  UKC  and  the  QAS  to 
enforce  or  to  modify  security  constraints.  The  CEC  also 
helps  the  DBA  in  several  security-related  decision  problems 
by  providing  lists  of  individuals  whose  security  are 
threatened  under  events  such  as  changes  in  user  groups  or  in 
users'  additional  knowledge. 

The  QAS  is  invoked  when  (a)  users  (or  the  DBA)  request 
insertion,  deletion  or  updates  of  individuals,  (b)  users 
gain  "some  more"  knowledge  and  further  inferences  are  to  be 
decided,  or  (c)  the  DBA  wants  to  change  the  conceptual  model 
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and  the  QAS  is  called  through  the  CMM.  In  all  these  cases 
the  function  of  the  QAS  is  to  decide  about  the  inferred 
knowl edge . 

5.2.3  Conceptual  Model  Modification  and  Security-Related 
Commands 

Conceptual  Model  Modification  commands  are  used  for 
creation,  deletion  or  decomposition  of  populations  and  for 
attribute  deletion  or  addition  to  populations.  It  is  equally 
important  to  have  commands  for  insertions  or  deletions  of 
users'  additional  knowledge  as  well  as  for  changing  security 
constraints  and  users'  allowed  statistical  query  types.  All 
commands  can  only  be  used  by  the  DBA.  All  commands  have  a 
test  mode  in  which  case  the  operation  is  not  executed,  but 
the  consequences  if  it  were  executed  are  reported  to  the  DBA 
(e.g.  possible  disclosures  if  the  modification  took  place 
are  reported,  etc.).  The  test  mode  may  be  identified  by  "*" 
preceding  the  command,  e.g.  *CREATE,  *DECOMPOSE,  etc..  Below 
we  describe  these  commands  based  on  the  D-A  Model.  However, 
they  may  be  modified  slightly  for  the  use  of  other  semantic 
data  models.  Note  that  while  processing  these  commands,  the 
rule  (1)  for  the  D-A  model  (Section  4.2.3)  is  not  checked  by 
the  SSMF,  but  is  confirmed  by  the  DBA. 
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Population  Formation  in  a  New  Generalization  Plane: 

Create  population  command  in  Figure  5.1  creates  a  new 
population  EMPLOYEE  in  a  new  generalization  plane. 

Population  EMPLOYEE  has  only  one  property,  SALARY,  for  which 
SUM,  MEDIAN  and  COUNT  queries  are  allowed.  IND  is  a  binary 
relation  (previously  prepared  by  the  DBA)  with  one  tuple  for 
each  EMPLOYEE  individual,  i.e.  ( empname , sal  ary ) .  EMPLOYEE 
object  is  not  an  aggregate  of  other  abstract  objects  (thus 
the  keyword  aggregate  of  may  be  deleted). 

Security  information  gives  security-related 
information.  INS-DEL  is  a  relation  with  one  tuple  for  each 
user,  i.e.  ( user id , id-ins , id-del ) ,  where  id-ins  and  id-del 
describe  whether  insertion  or  deletion  of  individuals  into 
EMPLOYEE  population  are  identifiable  (that  is,  whether  the 
user  group  can  identify  the  deleted  or  inserted  individual). 
SAL-UPDT  is  a  binary  relation  with  a  tuple  ( user id , id-upd ) 
for  each  user  group,  where  id-upd  describes  whether  salary 
updates  can  be  identifiable. 

FACT  is  a  binary  relation  with  tuples  ( user id , f cts ) , 
where  fcts  describes  an  explicit  fact  known  by  the  user 
group  and  not  existing  in  the  conceptual  model  or  in  the 
protection  data.  Similarly,  binary  relations  INFRNCES  and 
VSET  describe  general  rules  involving  the  EMPLOYEE 
population  and  SALARY  values  of  EMPLOYEE  individuals  that 
are  known  to  user  groups.  Users'  knowledge  of  upper  and 
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create  population  EMPLOYEE, 

properties  SALARY ( SUM, MEDIAN, COUNT) , 
individuals  IND, 
aggregate  of, 
security  information 
identif iabil i ty 

insertion-deletion  INS-DEL, 
property-update  SALARY ( SAL-UPDT) , 
knowledge 

facts  FACT, 

general  rules  INFRNCES, 
known-value  set  SALARY(VSET) , 
global  constraints, 
changes  processed  in  PAIRS, 
disclosure  check  EMPLOYEE ( SALARY) , 

end . 


Figure  5.1.  The  Create  Population  Command 
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lower  bounds  about  employees'  salaries  may  also  be  specified 
similarly  if  deemed  necessary. 

There  are  no  global  constraints  for  EMPLOYEE  population 
(hence  the  keyword  global  constraints  may  be  deleted).  If 
there  were,  procedure  names  would  be  given.  Finally  the 
command  also  specifies  that  changes  (i.e.  insertions, 
deletions  and  updates)  in  EMPLOYEE  population  are  processed 
in  pairs  in  order  to  prevent  disclosure. 

Execution  of  the  create  population  command  includes  the 
foil  owing . 

(1)  Conceptual  model  is  modified. 

(2)  Protection  data  is  modified,  i.e.  a  new  PDC  is 
created,  the  UKC  and  knowledge  base  of  each  user  group  are 
updated,  logical  name  tables  of  individuals  are  modified. 

(We  assume  that  the  authorization  information  is  specified 
by  other  means). 

(3)  The  CMM  using  the  QAS  checks  further  disclosures 
(e.g.  SALARY  property  values  of  all  EMPLOYEE  individuals  in 
Figure  5.1  are  checked),  reports  any  disclosure  to  the  DBA 
and  records  them  in  the  UKC  and  the  Knowledge  Set. 

Population  Decomposition  in  the  Generalization  Plane: 

The  decompose  command  in  Figure  5.2  partitions  the 
population  of  systems  programmers  in  Figure  4.5  into  three 
populations;  systems  programmers  with  B.S.,  M.S.  and  Ph.D. 
The  first  part  of  the  decompose  command  describes  the 
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modifications  to  the  conceptual  schema  and  the  second  part 
gives  information  about  each  newly  created  population. 

The  decompose  command  initiates  the  following 
operations . 

(1)  Modifications  to  the  conceptual  schema  are  made. 
Some  consistency  requirements  such  as  all  G-properties  (see 
Section  4.2.3)  of  a  newly  created  population  exist  in  its 
subpopulations  in  the  generalization  hierarchy,  etc.,  are 
satisfied  . 

(2)  Properties  of  each  newly  created  population  and 
user's  knowledge  are  entered  into  the  protection  data. 

(3)  The  CMM  using  the  QAS  decides  about  the  inferred 
knowl edge . 

Cluster  Deletion  of  a  Population: 

The  delete  cluster  command  below  removes  the  cluster 
consisting  of  the  populations  ASSIGNMENT-IN-DATABASE-PROJECT 
and  ASSIGNMENT-IN-RELIABILITY-PROJECT  in  Figure  4.5. 

delete  cluster  (ASSIGNMENT-IN-DATABASE-PROJECT, 
ASSIGNMENT-IN-RELIABILITY-PROJECT) ? 

The  delete  cluster  command  initiates  the  following 
operations : 

(1)  Clusters  specified  in  the  command  are  deleted  from 
the  SDB  schema. 

(2)  G-properties  of  deleted  populations  are  removed 
from  the  remaining  subpopulations  in  the  generalization 


; 


decompose 

parent  SYSTEMS -PROG RAMMER  with  cluster 

( SYSTEMS-PROGRAMMER-BS ,  SYSTEMS -PROGRAMMER-MS , 
SYSTEMS -PROGRAMMER- PHD) , 
new  population  SYSTEMS-PROGRAMMER-BS, 
properties  S ALARY ( SUM, COUNT)  , 
individuals  REL1 , 
aggregate  of, 
security  information 
identif iabil i ty 

insertion-del etion  INSERT-DELETE, 
property  update  SALARY ( UPDATE )  , 
knowledge, 

global  constraints  ROUT INE2 , 
changes  processed  in  TRIPLETS, 
disclosure  check  PROGRAMMER ( SALARY) , 
new  population  SYSTEMS-PROGRAMMER-MS, 


end . 


Figure  5.2.  The  Decompose  Command 


create  property  SALARY  (SUM,  COUNT), 
population  SECRETARY, 
individuals  INDVS, 
security  information 
identif iabil ity 

property  update  SALARY ( UPDT) , 
disclosure  check  EMPLOYEE ( SALARY) , 


end . 


Figure  5.3.  The  Create  Property  Command 


123 


hierarchy . 

(3)  The  PDC  is  modified.  Global  constraints  in  other 
populations  involving  deleted  populations  are  deleted.  All 
UKCs  are  updated. 

Insertion  and  Deletion  of  G-properties : 

For  consistency  reasons,  inserted  properties  in  a 
population  can  only  be  G-properties  and,  thus,  should  be  the 
property  of  all  its  sub-populations  in  the  generalization 
hierarchy.  The  create  property  command  in  Figure  5.3  adds 
the  property  SALARY  to  the  population  SECRETARY.  Notice  that 
the  create  property  command  attaches  the  same  statistical 
query  types  to  each  sub-population.  If  this  is  not 
desirable,  the  change  command  (described  below)  may  be  used 
to  make  modifications. 

In  addition  to  the  SDB  data  model  and  protection  data 
modifications,  the  QAS  is  also  executed  to  decide  further 
disclosures  (similar  to  the  create  population  or  the 
decompose  commands). 

For  consistency,  only  G-properties  are  deleted  from  all 
sub-populations  having  them.  An  example  may  be 

delete  attribute  SALARY, 
population  EMPLOYEE; 

Similar  to  the  delete  cluster  command,  conceptual  schema  and 


protection  data  are  updated  as  a  result  of  the  delete 
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a ttr ibute  command. 

Changing  the  User  Group's  Knowledge  in  a  Knowledge  Set: 

General  rules  or  explicit  facts  may  be  inserted  or 
deleted  using  commands  described  below. 

(1)  add  general  rules  INFERS 

disclosure  check  EMPLOYEE ( SALARY) ,  ENGINEER( DEGREE) ; 

(2)  add  facts  FACT, 

disclosure  check  EMPLOYEE (  SALARY)  , 
individual  JOHN-DOE ( DEGREE ) ; 

(3)  delete  general  rules  INFERS; 

(4)  delete  facts  FACT; 

INFERS  in  (1)  and  (3)  is  a  binary  relation  with  tuples 
( user id , inf-rul e ) ,  where  inf-rule  describes  an  general  rule. 
Similarly  FACT  in  (2)  and  (4)  is  a  binary  relation  with 
tuples  ( user id , fcts) ,  where  fcts  describes  an  explicit  fact. 
In  (2),  the  DEGREE  of  JOHN  DOE  is  checked  for  disclosure  as 
well  as  the  SALARY  properties  of  all  individuals  in  EMPLOYEE 
popul  a  tion . 

Changing  Constraints  and  Other  Security-Related  Information: 

The  add  constraint  and  the  delete  constraint  commands 
may  be  used  to  insert  or  remove  constraints  from 
populations,  for  example, 


add  constraint  (ROUTINE33, ASSIGNMENT ( COUNT) ) ; 
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delete  constraint  ( R3 1 , EMPLOYEE ( PAY- RATE) ) ; 

ROUTINE33  and  R31  are  the  names  of  modules  which  belong  to 
the  CEC.  Notice  that  all  these  constraints  are  inserted  into 
the  PDC  of  related  population.  Constraints  may  also  be 
inserted  into  the  UKC  of  specific  user  groups  similarly. 
Finally,  the  change  command  may  be  used  to  modify  the 
security-related  information  in  the  PDCs  or  UKCs .  An  example 
may  be 

change  population  EMPLOYEE, 

identif iabil ity  INS-DEL, 

statistical  query  types  SALARY ( SUM , MEAN) , 
changes  processed  in  PAIRS, 

end . 

5.3  Processing  Queries  and  Commands 

In  this  section  we  describe  how  statistical  queries  and 
security-related  commands  are  processed.  Note  that  the  DBA 
may  be  equipped  with  high-level  data  manipulation  operators 
(such  as  join,  project  or  select  of  relational  model). 
However,  queries  involving  these  operators  may  be  processed 
as  described  in  [Downs  and  Popek ,  1977]  and  will  not  be 
discussed  here. 

Statistical  Queries: 

Figure  5.4  shows  the  steps  in  processing  statistical 
queries.  The  queries  may  be  characteristic-specified  or 
population-specified.  The  QC  parses  the  statistical 
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Figure  5.4.  Retrieval  Operation  for  Statistical  Queries 
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retrieval  request  and  retains  the  request  type  (SUM,  MAX, 
MEAN,  etc.)  and  the  logical  name  for  population-specified 
queries. 

For  characteristic-specified  statistical  queries, 
whether  the  specified  characteristics  describe  an  existing 
population  in  the  D-A  model  or  not  is  checked,  if  so,  the 
name  of  the  population  and  the  query  type  are  retained, 
otherwise  the  query  is  rejected. 

The  QC  then  passes  to  the  main  kernel  the  type  of  the 
query,  the  logical  population  name  and  the  user's 
identification  supplied  by  the  operating  system.  At  the  same 
time  the  request  is  also  sent  to  the  DBMM.  The  DBMM 
functions  like  a  normal  database  management  system:  access 
paths  and  access  methods  are  decided,  performance  and  other 
statistics  are  recorded,  etc..  However  the  DBMM  is  not 
allowed  direct  access  to  the  database,  and  it  prepares  and 
passes  to  the  Main  Kernel  a  read  command  specifying  the 
physical  locations  that  are  to  be  accessed. 

Accesses  to  the  physical  database  are  only  done  by  the 
certified  Main  Kernel .  The  Main  Kernel  consists  of  I/O  and 
authorization  modules  and  the  CEC .  Before  allowing  access  to 
the  information  in  the  database,  the  Main  Kernel  checks  the 
protection  data's  authorization  information  to  see  if  the 
user  is  allowed  to  access  the  particular  population.  The 
request  from  the  DBMM  includes  physical  location  parameters 
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and  logical  individual  names.  (Both  of  these  information  are 
verified  correct  by  the  Main  Kernel  after  the  accesses  are 
made).  Since  the  protection  data  identifies  individuals  and 
populations  by  their  names  in  logical  name  tables,  the  Main 
Kernel  first  accesses  physical  property  values,  matches  them 
with  the  logical  individual  names  and  their  related  property 
specified  by  the  request  from  the  QC  and  verifies  the 
correctness  of  the  physically  retrieved  data. 

Once  the  data  is  retrieved,  the  CEC  in  the  Main  Kernel 
enforces  the  security  constraints  and  returns  the  result  to 
the  user  directly. 

In  the  above  retrieval  process,  all  the  security 
unrelated  functions  of  the  database  are  separated  from 
security  related  functions.  Although  the  DBMM  makes 
optimization  decisions  (access  path  finding,  etc.),  it  has 
no  authority  to  change  the  conceptual  model,  the  protection 
data  or  the  physical  database.  In  fact,  the  DBMM  does  not 
know  the  existence  of  the  protection  data.  Two 
security-related  modules,  the  QC  and  the  Main  Kernel,  are 
certified  and,  hence,  assured  of  secure  operation.  The 
protection  data  is  accessed  only  by  kernel  modules  during 
the  statistical  retrieval  operation. 


\ 
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User  (or  DBA)  -Requested  Insertion,  Deletion  or  updates  of 
individuals : 

Figure  5.5  shows  the  execution  of  a  change  operation. 

Up  to  the  Main  Kernel,  the  processes  are  similar  to  the 
execution  of  statistical  queries.  If  the  change  is  the 
insertion  of  an  individual,  the  CEC  checks  (say,  by  scanning 
the  change  sequence)  to  verify  that  this  is  the  only 
insertion  of  the  individual.  The  change  (say,  insertion  of 
individual  x  into  an  SA-popul  a tion)  is  recorded  into  the 
related  change  sequence  of  the  user  group  requesting  the 
change.  Then  for  each  user  group  u  and  for  each  property 
value  of  x  and  other  individuals  (if  any)  to  be  processed 
with  x,  the  CEC  sends  the  proper  questions  (see  Section 
5.1.1)  to  the  QAS  (with  the  temporary  inclusion  of  the 
knowledge  of  processing  x  and  the  other  individuals  into  the 
Knowledge  Set  of  u) .  If  the  QAS  confirms  that  there  is  no 
disclosure  then  the  change  is  processed  for  the  user  group  u 
(i.e.  it  is  revealed  to  the  users  in  user  group  u  and 
recorded  into  the  related  change  sequence  of  u).  Otherwise 
the  change  is  not  processed  and  the  change  information  is 
stored  as  a  SA-cons traint  in  the  UKC  of  the  user  group  u. 

The  constraint  and  the  possible  disclosure  are  also  reported 
to  the  DBA. 

Normally,  the  change  immediately  takes  place  in  the 
physical  database  and  the  conceptual  model.  However,  some 
security  constraints  may  exist  requiring  the  change  not  to 
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Figure  5.5.  User  or  DBA-Requested  change  operation 
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be  reflected  into  the  answers  of  the  statistical  queries. 

DBA-Requested  Conceptual  Model  Changes: 

In  Section  5.2.3,  six  conceptual  model  modification 
commands  have  been  identified:  create  population,  decompose 
population,  delete  cluster,  change  population,  create 
property  and  delete  property  commands.  The  certified  CMM 
controls  the  execution  of  these  commands.  The  general 
functions  of  the  CMM  are  given  below. 

(a)  The  CMM  keeps  the  conceptual  model  consistent  during 
changes,  and  the  security-related  rules  are  satisfied. 

(b)  The  CMM  prepares  the  input  to  the  QAS  and  the  output 
to  the  DBA  about  the  disclosures  due  to  the  proposed 
conceptual  model  changes.  The  populations  and  their 
properties  specified  by  the  disclosure  check  keyword  in  the 
command  identify  the  scope  of  search  for  disclosure. 

(c)  The  CMM  updates  the  protection  data  as  specified  by 
the  command.  New  individual  objects  are  recorded  into  the 
logical  name  tables.  The  CMM  also  enters  disclosures  found 
by  the  QAS  into  the  UKC  and  the  Knowledge  Set  of  the  user 
group. 

(d)  The  CMM  delegates  the  task  of  making  changes  in  the 
conceptual  schema,  external /conceptual  mappings, 

concep  tual /phys  ical  mappings  and  the  data  dictionary  to  the 
DBMM  since  they  are  not  security-related. 

(e)  Physical  database  changes  requested  by  the  DBA  (such 
as  new  individuals,  new  attributes,  etc.)  are  passed  on  to 
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the  Main  Kernel  by  the  CMM . 

Figure  5.6  shows  a  DBA-requested  conceptual  model 
change  operation.  The  QC  parses  the  query,  converts  the 
explicit  facts  and  general  rules  into  an  internal  form  and 
sends  the  request  type,  changes  and  other  security-related 
information  to  the  CMM.  The  CMM  makes  consistency  checks, 
detects  further  disclosure  by  communicating  with  the  QAS  and 
reports  it  to  the  DBA,  delegates  the  task  of  making 
conceptual  model  changes  to  the  DBMM,  updates  the  protection 
data  and  sends  the  physical  data  changes  to  the  Main  Kernel 
and  the  DBMM. 

The  DBMM  decides  about  access  paths  and  access  methods, 
prepares  necessary  locking  information  of  the  physical 
database,  etc.,  and  sends  the  I/O  request  together  with 
physical  locations  to  the  Main  Kernel.  The  Main  Kernel 
retrieves  property  values  of  individuals,  compares 
individuals  for  correct  1 ogical -physical  mapping,  ensures 
correctness  of  the  operation,  makes  the  changes  and  informs 
the  DBA  about  the  completion  of  the  operation. 

DBA-Requested  Security-Related  Commands: 

Section  5.2.3  lists  six  security-related  commands, 
namely,  addition  and  deletion  of  general  rules,  explicit 
facts  and  security  constraints.  All  of  these  commands  can 
only  be  used  by  the  DBA  and  effect  only  the  protection  data. 
Executions  of  these  commands  are  similar  to  the 
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Figure  5.6.  DBA-Requested  Conceptual  Model  Change  Operation 


134 


DBA-requested  change  operations  except  that  there  are  no 
physical  database  changes  and  thus  the  DBMM  and  the  Main 
Kernel  are  not  involved.  Figure  5.7  shows  the  execution  of  a 
DBA-requested  security-related  command. 

5.4  Discussion 

The  SDB  design  proposed  in  this  chapter  increases  the 
effectiveness  of  the  protection  scheme  and  the  richness  of 
the  SDB.  However  efficiency  of  protection  may  degrade  for 
two  reasons . 

(1)  The  QAS  confirms  the  security  of  information  by 
showing  that  the  information  is  not  deducible.  As  it  has 
been  pointed  out  to  us  [Schubert,  1979],  the  best  way  to 
show  that  something  is  not  provable  may  not  be  to  try  to 
prove  it.  Clearly,  confirming  the  security  of  secure 
information  may  become  inefficient  if  the  whole  search  space 
is  to  be  searched.  Improvement  regarding  this  problem  is 
possible  when  (a)  the  search  space  is  very  small,  (b) 
irrelevant  data  or  general  rules  are  avoided  during  the 
deductive  process,  and  (c)  the  deductive  search  is  aided  by 
semantic  information.  Another  approach  which  may  improve  the 
efficiency  of  the  SDB  is  to  compile  the  intensional  part  of 
each  user  group's  Knowledge  Set  once,  using  a  suitably 
designed  interactive  theorem  prover  [Reiter,  1979]  .  One  of 
the  advantages  of  this  approach  is  that  it  eliminates  the 
need  for  a  theorem  prover  at  query  evaluation  time. 

(2)  Storage  requirements  of  the  protection  data  may  be 
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Figure  5.7.  Execution  of  a  DBA-Requested,  Security-Related 
Command 


too  high  when  there  are  several  Knowledge  Sets  with 
duplicate  information.  A  solution  may  be  to  keep  a  si 
Knowledge  Set  with  additional  information  as  to  "who" 


ngl  e 
knows 


"what" . 


CHAPTER  6 


CONCLUSIONS 

In  this  thesis  the  security  of  statistical  databases  is 
investigated  in  the  context  of  a  statistical  database 
design.  The  importance  of  semantic  meaningfulness  of  users' 
queries  is  stressed.  It  is  argued  that  this  will  enhance  the 
security  by  not  permitting  malicious  users  to  form 
meaningless  queries  in  order  to  use  their  responses  in 
combinatorial  formulas  for  compromise.  A  natural  extension 
of  this  argument  and  others  led  to  the  usage  of  semantic, 
redundant  and  structural  conceptual  data  models  in  the 
design  of  a  statistical  database.  New  results  involving  a 
comprehensive  secure  SDB  design  have  been  described  in  this 
thesis . 

The  partitioning  model  is  discussed  in  order  to  be  used 
as  a  tool  in  the  SDB  design.  Primitive  change  operations  are 
allowed  in  the  model,  and  the  conditions  are  derived  to 
prevent  compromise.  Variations  of  the  partitioning  model 
which  use  rounding,  data  perturbation  and  both  are 
introduced  to  remove  some  of  the  assumptions  made  in  the 
partitioning  model . 
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Within  the  context  of  a  formal  framework,  an  SDB  design 
using  security  constraints  at  the  conceptual  data  model 
level  is  proposed.  Three  different  structural,  semantic  and 
redundant  data  models  are  investigated  and  the  D-A  model 
[Smith  and  Smith,  1977]  is  chosen  as  a  conceptual  data  model 
of  the  SDB.  The  population  concept  is  utilized  to  identify 
semantically  well-defined  objects  about  which  statistical 
information  is  revealed  to  users.  For  this  purpose,  a  simple 
construct  called  the  Population  Definition  Construct  (PDC) 
is  introduced  for  each  population  in  the  conceptual  model. 


It  is  argued  that,  for  complete  protection,  users' 
additional  knowledge  should  be  maintained  and  kept  up-to- 
date.  Users'  additional  knowledge  may  take  the  form  of 
general  rules  and  explicit  facts.  The  SDB  design  proposed 
herein  maintains  users'  knowledge  of  only  protected  property 
values  of  individuals  in  the  SDB,  and  this  is  implemented 
using  a  simple  construct  called  the  User  Knowledge  Construct 
( UKC ) . 


In  order  to  keep  the  PDCs  and  UKCs  up-to-date,  to 
enforce  the  security  constraints  and  to  help  the  DBA  in 
security-related  decision  problems,  the  constraint  enforcer 
and  checker  (CEC)  is  proposed.  The  CEC,  UKCs  and  PDCs 
comprise  the  Statistical  Security  Management  Facility 
(SSMF).  Implementation  issues  of  the  SSMF  are  briefly 
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A  novel  property  of  the  SDB  design  in  Chapter  4  is  that 
it  can  be  "added  on"  to  an  existing  general-purpose  database 
system  without  major  modifications  to  its  DBMS.  This  feature 
may  be  useful  if  the  general-purpose  database  has 
confidential  information  about  which  statistical  queries  are 
permitted.  In  such  a  case,  the  SSMF  is  added  to  the  existing 
DBMS  and  security  constraints  are  enforced  for  each 
statistical  user  query. 

Different  types  of  inferences  by  users  are  identified, 
and  possible  security  constraints  for  different  types  of 
statistical  queries  are  investigated.  It  is  demonstrated 
that,  usually  simple  security  constraints  can  be  defined  to 
protect  the  SDB  from  compromise. 

Extensions  to  the  SDB  design  are  described  in  order  to 
increase  effectiveness  of  the  protection  and  the  richness  of 
the  SDB.  A  que s tion-Answe r ing  System  with  deductive 
inference  mechanisms  is  proposed  for  deciding  the  inferred 
knowledge.  For  each  user  group,  a  Knowledge  Set  which  keeps 
the  user  group's  additional  knowledge  of  general  rules  and 
explicit  facts  relevant  to  the  application  is  proposed.  A 
set  of  security-related  commands  for  changes  in  the 
conceptual  model  are  proposed  (for  the  D-A  model)  in  order 
to  give  the  SDB  the  capability  to  reflect  the  changes  in  the 
real  world.  It  is  argued  that  a  security  kernel  architecture 
will  enhance  the  security  of  the  SDB.  Finally,  an  SDB  design 
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is  discussed  which  includes  these  extensions.  Problems  with 
the  efficiency  of  the  design  are  summarized. 

There  are  several  avenues  of  research  related  with  our 
approach  of  SDB  design.  One  direction  of  research  is  to  find 
some  at  least  semi-automated  security  measures  to  help  the 
DBA  assess  how  secure  the  SDB  is.  In  this  thesis,  using  the 
information  graph,  two  very  simple  security  measures  are 
defined.  However,  both  of  these  measures  are  very  crude  and 
apply  to  only  SUM  and  COUNT  queries.  Some,  maybe 
probabilistic,  security  measures  are  needed  for  all  types  of 
statistical  queries.  Another  research  area  is  the  deductive 
components  of  the  QAS.  Clearly  more  research  is  needed  there 
to  make  the  SDB  design  in  Chapter  5  efficient.  Finally,  the 
proposed  SDB  design  has  yet  to  be  implemented  and  tested  in 
a  real-world  application.  This  may  shed  new  light  on  the 
problem  of  secure  SDB  design. 

The  SDB  design  in  this  thesis  was  motivated  by  the 
premise  of  supplying  the  SDB  users  with  only  what  they  need, 
i.e.,  as  a  prerequisite,  a  semantically  meaningful 
information.  It  seems  to  the  author  that  further 
categorization  of  the  needs  of  users  may  dictate  a 
hierarchically  designed  SDB  with  totally  different  atomic 
information  units,  security  constraints  and  constructs  at 
each  level  of  the  hierarchy.  Consider,  for  example,  a  large 
company  with  its  executives,  managers  and  engineers,  etc.. 
Clearly  the  statistical  information  needed  by  engineers  for 
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research  purposes  is  quite  different  than  the  statistical 
information  needed  by  its  executives  for  decision-making 
purposes. 

In  this  thesis,  output  and  data  perturbation  techniques 
have  been  used  only  to  remove  some  deficiencies  of  the 
partitioning  model .  These  protection  techniques  and  sampling 
may  be  very  useful  in  those  cases  where  SDB  users'  needs  are 
very  diverse  or  the  implementation  of  a  conceptual  model  and 
the  SSMF  are  not  feasible.  There  are  promising  research 
results  about  the  protection  techniques  of  data  perturbation 
and  random  sampling  [Beck,  1979;  Denning,  1979],  however, 
our  understanding  about  the  efficiency,  effectiveness  and 
usefulness  of  these  techniques  is  yet  to  be  furthered. 
Another  research  direction  may  be  to  investigate  auditing. 
Clearly,  one  measure  of  the  SDB  security  problem  is  the 
number  of  queries  that  the  user  has  asked,  i.e.  the  more 
queries,  the  more  likely  danger  of  compromise.  Fast  auditing 
algorithms  under  different  SDB  models  may  be  yet  another 
research  direction. 
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Appendix  A 


Proof  of  Lemma  in  Section  3.2.3:  Let  b'=(b-l)/2  and 

b"=l+(b-l)/2.  We  will  first  show  that  p ( j ) =pD ( b- j ) , 

K  K 

l<ji(b-l)/2,  where  pR  is  the  probability  function  for  the 
r.v.  R.  Since  the  ' s  and  T  are  symmetrically  distributed, 
from  (2)  and  (3)  in  Section  3.2.3,  S  is  also  symmetrically 
distributed  about  the  mean  ^s=Mv,MZ'  anc^  ^ts  variance 

Var(S)  =  Var(T)  +  Mz.Var(V)  +  Var(Z).M^ 

Clearly,  for  l<ji(b-l)/2, 

PR( j )=  Zps( ib+j )  and  pR(b- j ) =  ^pg ( ib- j )+ 

Since  M  =kb,  for  -kliik,  llji(b-l)/2,  we  have 


pg((k-i)b+j)  =  ps ( (k+i ) b- j ) 


(1) 


and  for  i=-l,-2,....  and  l<ji(b-l)/2,  we  have 


Pg(ibtj)  =  pg( (2k+ I i I )b- j ) 


(2) 


and 


Ps( (2k+| i I )b+j )  =  pg(ib-j) 


(3) 


Addition  of  (1),(2)  and  (3)  gives 


>.pc  (  ib+ j  )  =  >_ps  (  ib- j  ) 


or 


PR(j)  =  PR(b-j) 


(4) 


+The  notation  Z  denotes  summation  over  all  possible 
integer  y  values.  Y 
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To  show  ( a ) : 


Since 


b-1 

Prob(W=w)  =  2.  Prob  ( W=w  I  R=r  )  .  p  ( r  ) 
r=0  R 


b 


b-1 


b-1 

+  E  p  (r).[  £  w.p„(w+r-b)] 


b  ' 

=  E  pR(r).[  E(  t-r ) p_ ( t ) ]  + 

r=l  R  t  T 


b-1 


PR(r).[  E(t+b-r )pT( t)  ] 


b-1 


b 


b 


=  E  (b-r)p  (r)  -  >_  r.p  (r)  =  >_  r[p  (b-r)  -  p  (r)] 

r=b"  R  r=l  R  r=l  R  R 


=  0  due  to  ( 4 ) . 


To  show  (b) : 


Var(W)  =  E(W2)  =  >.  w2.pw(w)  =  PR(r).[  >  ( t-r )  .  Prp  ( t )  ] 

w  ”  r=0  R  t  i 


b-1 

+  E  PR  (  r  )  .  [  E(  t+b-r  )  .p  ( t )  ] 
r=b"  R  t  T 


=  E  pD(r)  •  ( Var  (  T)  +r  )  +  >_  p  ( r ) . [Var ( T) +(b-r )  ] 
r=0  R  r=b "  R 


=Var  (  T )  +  >.  r  2  .p  (  r  )  +  >_  p  (  r  )  .  (b2-2br+r  2  ) 
r=0  R  r=b"  R 


=  Var(T)  +  E ( R2 )  +  b.  £  p  (r).(b-2r) 

r=b"  R 
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=  Var(T)  +  E ( R2 )  -  b.  Z  (b-2r) 

r=l 

=  Var(T) 

since 
E(R2) 

=  Z(r 2+(b-r ) 2 ) .p  ( r )  =  b  Z  pp(r).(b-2r)  # 

r=l  K  r=l  K 


b-1  2 

=  >.  r  .p  (r)  and,  using  (4)  we  have 

r=l  R 
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Appendix  B 


Usage  of  Dummy  Records  in  the  Partitioning  Model : 

(1)  Assume  a  path  has  two  active  vertices,  then  while 
inactivating  the  active  vertices  (during  their  deletion), 
one  may  use  two  dummy  records,  and  then  forces  the  path  into 
a  cycle  by  deleting  the  two  dummy  records. 


(2)  Assume  a  path  has  a  single  active  vertex,  then 
while  inactivating  the  active  vertex  (during  its  deletion), 
one  may  introduce  a  new  active  dummy  record,  dr^,  and  later 
on,  when  there  are  sufficient  (e.g.  t)  dummy  records  like 
dr^,  they  may  be  deleted  from  the  partition  altogether. 


(3)  Assume  r^  and  r^  are  requested  to  be  in  different 
reachability  sets.  Assume  also  the  vertex  r&  is  already  in 
the  information  graph  and  vertex  r^  is  to  be  formed.  If 
vertex  r^  has  a  danger  of  being  connected  to  vertex  r&  then 
the  system  uses  a  dummy  record  dr^  to  create  a  new  path  with 
vertices  r,  and  dr-;  afterwards,  either  by  checking  at  each 
change  operation  or  making  one  of  the  paths  that  r^  and  r^ 
belongs  to  inactive,  the  system  may  keep  the  two  paths 
s  eparate . 


Some  of  the  updated  records  may  run  the  danger  of  being 
traced  by  the  user.  When  this  happens,  there  is  a 
possibility  of  an  odd  cycle  since  the  information  graphs  of 
two  different  partitions  are  connected.  When  the  usage  of 


dummy  records  are  allowed,  a  procedure  similar  to  (3) 
described  above  can  be  used  to  prevent  the  formation  of  odd 
cycles  for  "traceable"  update  operations;  if  two  records  r  , 

3. 

rb  are  in  the  same  reachability  set  of  partition  p^, 

are  moved  to  another  partition  p^  then  they  may  be  kept  in 
two  disconnected  paths  using  (3)  described  above. 


■ 


